Skip to content

Open Source LLMs in 2026: The Definitive Comparison

DS
LDS Team
Let's Data Science
21 minAudio
Listen Along
0:00/ 0:00
AI voice

At the start of 2025, a Chinese AI lab released a reasoning model under an MIT license that matched OpenAI's best models on math benchmarks. It cost less than $6 million to train. Nvidia lost $589 billion in market capitalization in a single day. The message was clear: open source LLMs had arrived, and the rules of the game had changed permanently. If you're building a product in 2026 — an AI coding assistant, a customer-facing chatbot, a document intelligence pipeline — you no longer have to default to a proprietary API. The question is which open model to choose, and for which job.

This guide answers that question with current specs, verified benchmarks, and a practical framework for a startup evaluating open source LLMs for a production coding assistant.

Why open source models matter in production

Open source LLMs solve problems that API-first approaches can't. When you self-host a model, you don't send customer code through a third-party server. That matters deeply for companies subject to HIPAA, SOC 2, GDPR, or financial regulations. More practically, it matters for any company building a coding assistant: you'd be sending proprietary source code to an external provider on every inference call.

Beyond privacy, the economics are compelling. A startup processing 10 million tokens per day through GPT-5 would pay approximately $620,000 annually. Self-hosting Llama 3.3 70B on two H100 GPUs costs roughly $52,000 a year — an 88% reduction. At scale, even a mid-size cloud GPU bill beats premium API pricing by a factor of 5 to 10.

The third reason is customization. Closed APIs don't let you fine-tune on your own data without handing that data over. Open weights do. You can train a coding assistant on your entire internal codebase, run LoRA adapters for different programming languages, and update the model as your codebase evolves. That level of control doesn't exist with any closed API.

Key Insight: The benchmark distance between open and closed models has shrunk from a canyon to a crack. At end of 2023, the best closed model scored ~88% on MMLU while the best open model managed ~70.5% — a 17.5-point gap. By early 2026, that gap is effectively zero on knowledge benchmarks.

The model families dominating open source in 2026

Open source LLM families by size and context window in 2026Click to expandOpen source LLM families by size and context window in 2026

Llama 3.3 70B and Llama 4 (Meta)

Llama 3.3 70B, released December 2024, is the mature choice for teams wanting high performance in a well-understood package. It uses 70 billion parameters, a 128K-token context window, and was pretrained on 15 trillion tokens. Meta trained it on 39.3 million GPU hours of NVIDIA H100 compute.

The benchmark story is strong. Llama 3.3 scores 92.1 on IFEval (instruction following) — higher than both Llama 3.1 405B at 88.6 and GPT-4o at 84.6. On MMLU it reaches 86.0%, matching Amazon Nova Pro and edging out GPT-4o at 85.9%. On MATH (chain-of-thought), it scores 77.0, a significant jump from Llama 3.1 70B's 67.8. HumanEval reaches 88.4% pass@1. In short, it performs at close to Llama 3.1 405B quality in a fraction of the size, which is the whole pitch.

Hardware-wise, running Llama 3.3 70B in FP16 requires about 142 GB VRAM — two H100 80GB GPUs. In 4-bit quantization (Q4), you can squeeze it onto a single 40 GB A100 or two consumer RTX 4090s with tolerable quality loss.

Llama 4, released April 5, 2025, takes a different architectural approach with Mixture-of-Experts (MoE). Llama 4 Scout (109B total, 17B active, 10 million-token context) and Llama 4 Maverick (400B total, 17B active, 128 experts, 1 million-token context) are the shipping variants. Maverick scores 85.5 on MMLU and 86.4% on HumanEval. The extreme context windows are useful for whole-codebase understanding, which is relevant to the coding assistant use case.

One caveat: Llama 4 ships under the Llama Community License, not an OSI-approved open-source license. Organizations with 700 million or more monthly active users need separate Meta permission. The acceptable use policy also restricts certain applications. For most startups this is irrelevant, but legal teams should review it before building commercial products.

Mistral Small 3.2 and Mistral Large 3 (Mistral AI)

Mistral's lineup in 2026 covers two distinct sweet spots: an efficient small model and a frontier-class large model.

Mistral Small 3.2 (24B parameters, June 2025) is the efficiency champion and the current recommended version — it builds on 3.1 with improved instruction following (82.75% to 84.78% on internal accuracy), better function calling for production tool-use, and coding gains (HumanEval Plus jumped from 88.99% to 92.90%). It packs a 128K-token context window, vision capabilities, and 81%+ MMLU into a model that runs on a single RTX 4090 or fits on a 32GB MacBook with quantization. Inference speed is more than 3x faster than Llama 3.3 70B on equivalent hardware. If you're building on a single consumer GPU or deploying on edge hardware, Mistral Small 3.2 is the strongest option at this size tier.

Mistral Large 3 (675B total, 41B active via MoE, December 2025) plays in the frontier tier. With a 256K-token context window, it scores 73.11% on MMLU-Pro and 93.60% on MATH-500. The notable licensing milestone: Mistral Large 3 ships under Apache 2.0, marking a shift from Mistral's earlier restrictive custom licenses. For teams needing frontier-level quality with maximum licensing flexibility, this is a compelling option.

Both Mistral models speak multiple European languages well, reflecting the company's French origins and EU market focus. If multilingual European coverage matters for your use case, Mistral consistently outperforms equivalently-sized alternatives on French, German, Italian, and Spanish.

Gemma 3 (Google)

Google's Gemma 3 family, released March 2025, occupies the lightweight end of the spectrum. Sizes run from 270M to 27B parameters, with multimodal (vision + text) support at 4B, 12B, and 27B. The flagship Gemma 3 27B scored 42.4 on GPQA Diamond (graduate-level science) and 69.0 on MATH, beating Gemini 1.5 Pro on several benchmarks despite being fully self-hostable.

The 4B and 12B variants are the most practically interesting: Gemma 3 4B beats Gemma 2 27B, which means a model that runs on a laptop now outperforms what required a server a generation ago. Context windows are 128K for most sizes (32K for the 1B), and Gemma 3 supports 140+ languages.

The licensing caveat matters here. Gemma ships under Google's Terms of Use, not Apache 2.0 or MIT. Google retains the right to restrict usage, and the license conditions propagate to models fine-tuned on Gemma's synthetic outputs. For a production deployment where you'd fine-tune on proprietary code, this creates legal ambiguity worth discussing with counsel.

For experimentation, prototyping, and non-commercial research, Gemma 3 27B is excellent. For production products, the licensing uncertainty is a real consideration.

Common Pitfall: "Gemma is open source" is widely repeated, but technically incorrect. Gemma is open-weight under Google's Terms of Use, not an OSI-approved open-source license. This matters if you plan to fine-tune and distribute derivative models.

DeepSeek V3, V3.2, and R1 (DeepSeek AI)

DeepSeek's model family has evolved rapidly and now spans multiple generations.

DeepSeek V3 is a 671B total parameter MoE model with 37B parameters active per token, trained for ~$5.6 million. The March 2025 update (V3-0324) pushed MMLU-Pro from 75.9 to 81.2 and GPQA from 59.1 to 68.4 — significant gains. DeepSeek V3.2, released December 2025, introduced a sparse attention mechanism that reduces computational complexity from O(n²) to O(n) for long contexts, extending the context window to 163K tokens. V3.2 now matches GPT-5 and Claude Sonnet 4.5 across multiple benchmarks.

DeepSeek R1 (MIT license, January 2025) adds a chain-of-thought reasoning layer trained via Group Relative Policy Optimization (GRPO), a reinforcement learning technique that doesn't require a separate reward model. The original R1 scored 97.3 on MATH-500 and 79.8 on AIME 2024. The May 2025 update, R1-0528, made significant gains: AIME 2025 jumped from 70.0 to 87.5, GPQA from 71.5 to 81.0, and hallucination rates dropped 45 to 50%. R1-0528 is now the second-highest AIME performer among open models, behind only OpenAI o3.

Both families support 128K-token context windows (V3.2 supports 163K). To run the full 671B models you need 8 H100s or equivalent; distilled variants (7B, 14B, 32B, 70B) offer the reasoning methodology in smaller packages that fit on a single GPU.

Qwen 2.5 and Qwen 3 (Alibaba)

Alibaba's Qwen family deserves more Western attention than it typically receives. Qwen 2.5, released September 2024, marked a step-change in multilingual capability and coding performance.

The Qwen 2.5 72B instruction-tuned model scores 86.0% on MMLU and 83.1 on MATH — competing directly with Llama 3.1 405B at a fraction of the size. Training data: 18 trillion tokens. Available sizes range from 0.5B through 72B, all under Apache 2.0, with specialized Qwen 2.5 Coder variants optimized for code.

Qwen 3, released April 28, 2025, introduced a hybrid reasoning mode: a single model can switch between fast "non-thinking" mode (standard generation) and slow "thinking" mode (extended chain-of-thought reasoning) via a simple parameter. The flagship Qwen3-235B-A22B (235B total, 22B active) outperforms DeepSeek R1 on 17 of 23 benchmarks despite smaller active parameter count.

Every Qwen model — from 0.6B to 235B — ships under Apache 2.0. No user limits, no acceptable use policy, no attribution requirements beyond the license notice. For a startup building a commercial product, this is the cleanest licensing position in the open model ecosystem.

For the coding assistant use case specifically: Qwen 2.5 Coder 32B scored higher than GPT-4o on several code generation benchmarks at the time of Qwen 3's release, and Qwen3-30B-A3B (a smaller MoE variant) runs at nearly 196 tokens/s on an RTX 4090. If multilingual code (Chinese, Japanese, Korean comments in codebases) matters, Qwen dominates this space.

New frontier entrants: Kimi K2 and GLM-5 (2026)

Two models released in mid-to-late 2025 and early 2026 pushed open-weight performance to new heights.

Kimi K2 (Moonshot AI, July 2025) is a 1 trillion total parameter MoE model with 32B active parameters, pretrained on 15.5 trillion tokens. It achieves 65.8% on SWE-bench Verified — the strongest open-weight score on production software engineering tasks. The K2.5 variant, released January 27, pushed this to 76.8% SWE-bench, 87.6% GPQA Diamond, and 96.1% AIME 2025. For agentic coding at the frontier, K2.5 is the open-weight leader. The model ships under a modified MIT license.

GLM-5 (Zhipu AI, February 2026) scales from 355B parameters (32B active) to 744B parameters (40B active), pretrained on 28.5 trillion tokens. It achieves 77.8% on SWE-bench, 86.0% GPQA Diamond, and 90.0% HumanEval — making it competitive with the best closed models on agentic tasks. The 744B variant (covered in our China GLM-5 744B article) was notably trained on Huawei Ascend chips, demonstrating that frontier open-weight training no longer requires NVIDIA hardware.

Full model comparison table

ModelParams (Active)ContextMMLUHumanEvalLicenseBest For
Gemma 3 4B4B128K~60~50Google ToUEdge, mobile
Phi-4 14B14B16K84.8~82MITReasoning on laptop
Mistral Small 3.224B128K81+92.9Apache 2.0Efficient prod, edge
Gemma 3 27B27B128K~68~58Google ToUResearch, prototyping
DeepSeek R1 Distill 32B32B128K~80~88MITSingle-GPU reasoning
Llama 3.3 70B70B128K86.088.4Llama CommunityGeneral instruction-following
Qwen 2.5 72B72B128K86.073.2Apache 2.0Multilingual, coding
Llama 4 Scout109B (17B active)10M~79~79Llama CommunityLong-context tasks
Llama 4 Maverick400B (17B active)1M85.586.4Llama CommunityGeneral + coding
Qwen3-235B235B (22B active)128K~89~90Apache 2.0Frontier quality + reasoning
DeepSeek R1-0528671B (37B active)128K90.890.2MITReasoning, math, coding
DeepSeek V3.2671B (37B active)163K88.5+~84Custom (commercial OK)Long-context, general frontier
Kimi K2.51T (32B active)128K~8899.0Modified MITAgentic coding
Mistral Large 3675B (41B active)256K73.1 (Pro)~85Apache 2.0Frontier reasoning
GLM-5744B (40B active)128K~8790.0Apache 2.0Agentic tasks, coding

Pro Tip: MMLU and HumanEval scores from different sources often reflect different evaluation setups (few-shot vs. zero-shot, pass@1 vs. pass@10). Use them for directional comparisons, not absolute rankings. Always run your own eval on your target task before committing.

VRAM requirements by GPU tier

Knowing which model fits your hardware is half the deployment decision. Here's a practical guide to what runs where in 2026:

Open source LLM VRAM requirements by GPU tierClick to expandOpen source LLM VRAM requirements by GPU tier

GPU TierVRAMModels (FP16 / Q4)Notes
RTX 4090 / 309024 GBUp to 32B (FP16), up to 70B (Q4 with offload)Qwen3-30B-A3B runs at 196 tok/s on RTX 4090
RTX 509032 GBUp to 70B comfortably (Q4)Best consumer card for 70B class
MacBook Pro M3/M4 Max64 GB unifiedLlama 3.3 70B (Q4), Mistral Small 3.2 (FP16)Apple Silicon unified memory avoids VRAM bottleneck
Single H100 80GB80 GBUp to 70B (FP16), Qwen3-235B (Q4)Recommended minimum for production 70B
2× H100160 GBLlama 3.3 70B (FP16), DeepSeek R1 distills to 70BStandard production setup
4× H100320 GBLlama 4 Scout (FP16), Qwen3-235B (Q8)Mid-tier production cluster
8× H100640 GBDeepSeek R1 671B (FP16), Kimi K2.5 (Q4)Frontier open-weight inference

Rule of thumb: multiply parameter count by 2 for FP16 VRAM (70B = ~140 GB), divide by 4 for Q4 quantization. Add 10 to 15% for KV cache overhead. Qwen 3 MoE models are notably efficient — the 30B-A3B variant fits on a single 24GB card while activating only 3B parameters per token.

Choosing by use case

Open source LLM selection guide by use caseClick to expandOpen source LLM selection guide by use case

For a production coding assistant: DeepSeek R1-0528 or Kimi K2.5 are the top choices as of March 2026 for frontier coding quality. R1's GRPO-trained reasoning means it can explain architectural decisions and debug logic, not just complete code tokens. Kimi K2.5's 76.8% SWE-bench score leads all open-weight models on real-world software engineering. Qwen 2.5 Coder 32B is the most practical single-GPU choice. The coding assistant use case also benefits from long context (to pass entire files) — Llama 4 Scout's 10M-token context window is useful for whole-repository understanding.

For RAG and chat applications: Mistral Small 3.2 or Llama 3.3 70B. RAG benefits from a large context window and strong instruction following — both score highly on IFEval. Mistral Small 3.2 is faster and more cost-efficient. If you're already building a Retrieval-Augmented Generation (RAG) pipeline, a 24B model running at 3x the speed of a 70B model means lower latency per request.

For multilingual applications: Qwen 2.5 72B or Qwen 3. The Qwen models are trained on 18 trillion tokens with heavy multilingual coverage and are the clear choice for East Asian languages (Chinese, Japanese, Korean). Gemma 3 also supports 140+ languages but with less depth in non-English tasks. Mistral is the strongest for European languages.

For on-device and edge deployment: Gemma 3 4B or Mistral Small 3.2 quantized. Gemma 3 4B runs in 4-bit quantization on Apple Silicon M-series chips at interactive speeds. Qwen3-30B-A3B, despite its 30B total parameters, activates only 3B per token and runs comfortably on a single RTX 4090 at 196 tokens/s. For truly constrained environments (mobile, embedded), look at Qwen 2.5 0.5B or 1.5B.

For fine-tuning on your own data: Qwen 2.5 (Apache 2.0, any size) or DeepSeek R1 distills (MIT). Clean licensing with permissive fine-tuning rights matters here. See Fine-Tuning LLMs with LoRA and QLoRA for implementation details — the techniques covered there apply directly to these models.

How to run open source models

Deploying an open model used to require ML infrastructure expertise. In 2026, the tooling has made it considerably more accessible.

Open source LLM deployment options from local to productionClick to expandOpen source LLM deployment options from local to production

Ollama (local development)

Ollama is the easiest entry point. It handles model downloading, quantization selection, and serving in a single tool, and exposes an OpenAI-compatible API on localhost.

bash
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run DeepSeek R1 7B distill
ollama run deepseek-r1:7b

# Run Llama 3.3 70B (requires ~40 GB VRAM with Q4 quantization)
ollama run llama3.3:70b

# Run Mistral Small 3.2
ollama run mistral-small3.2

Ollama is ideal for development and testing. For production workloads, you'll hit its throughput ceiling quickly.

llama.cpp (inference on CPU/GPU)

Llama.cpp is the most portable inference engine. Written in C++, it runs on CPUs (using AVX2/AVX-512 for acceleration), CUDA GPUs, Metal (Apple Silicon), and ROCm (AMD). It's the backbone of many production deployments on modest hardware.

bash
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j $(nproc) LLAMA_CUDA=1

# Download a GGUF model from Hugging Face
# (e.g., Qwen2.5-72B-Instruct-Q4_K_M.gguf)
./llama-cli -m models/Qwen2.5-72B-Instruct-Q4_K_M.gguf \
  -n 512 \
  -p "Implement a binary search tree in Python"

GGUF quantization levels (Q4_K_M, Q5_K_M, Q8_0) offer tradeoffs between model size, VRAM usage, and output quality. Q4_K_M cuts memory roughly 4x vs FP16 with about 2 to 5% quality degradation on most tasks.

Hugging Face Inference (managed, scalable)

For teams that want open model quality without infrastructure management, Hugging Face Inference Endpoints provide serverless and dedicated GPU endpoints for any model in the Hub:

python
from huggingface_hub import InferenceClient

client = InferenceClient(
    model="mistralai/Mistral-Small-3.2-24B-Instruct-2506",
    token="hf_..."
)

response = client.text_generation(
    prompt="<s>[INST] Review this Python function for bugs:\n\n```python\ndef binary_search(arr, target):\n    left, right = 0, len(arr)\n    while left < right:\n        mid = (left + right) // 2\n        if arr[mid] == target:\n            return mid\n        elif arr[mid] < target:\n            left = mid + 1\n        else:\n            right = mid\n    return -1\n```\n[/INST]",
    max_new_tokens=512
)

Costs run roughly $0.60 to $1.20 per million tokens for 24B-class models — significantly cheaper than GPT-4-class APIs while maintaining data isolation.

vLLM and SGLang (high-throughput production)

For 2026 production deployments, you have two strong options: vLLM and SGLang. They've taken different architectural bets.

vLLM uses PagedAttention to manage KV cache memory as dynamically allocated blocks. It's the more mature ecosystem (3x to 5x higher throughput than naive HuggingFace Transformers serving), has the widest model compatibility, and exposes OpenAI-compatible endpoints.

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    tensor_parallel_size=2,
    max_model_len=32768
)

sampling_params = SamplingParams(temperature=0.6, max_tokens=2048)

prompts = [
    "Explain the Big O complexity of quicksort with a partition analysis.",
    "Write a Redis-backed rate limiter in Python with sliding window semantics."
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

SGLang is the faster option for multi-turn workloads. On H100 GPUs, SGLang achieves 16,215 tokens/s versus vLLM's 12,553 — a 29% throughput advantage. Its RadixAttention mechanism caches and reuses KV states across turns, which is highly effective for coding assistants (where system prompts and conversation history repeat across requests). For multi-turn conversations with shared context, SGLang shows a 10 to 20% additional speed boost over vLLM.

python
import sglang as sgl

@sgl.function
def code_review(s, function_code):
    s += sgl.system("You are an expert Python code reviewer.")
    s += sgl.user(f"Review this function:\n\n```python\n{function_code}\n```")
    s += sgl.assistant(sgl.gen("review", max_tokens=512))

state = code_review.run(function_code="def fib(n): return fib(n-1) + fib(n-2)")
print(state["review"])

Pro Tip: For a coding assistant startup, the typical path is: Ollama for development, Hugging Face Inference for early production, vLLM on dedicated GPUs for batch workloads, SGLang when you're optimizing multi-turn latency and need KV cache reuse. The 29% throughput gap of SGLang over vLLM translates to roughly $15,000 in monthly GPU savings at one million requests per day.

License comparison: what "open" actually means

License terms matter more than most teams realize until they have a legal review.

LicenseModelsCommercial UseRedistributeFine-tune + DistributeUser Limits
MITDeepSeek R1, Phi-4, Kimi K2YesYesYesNone
Apache 2.0Qwen 2.5/3, Mistral Small 3.2/Large 3, GLM-5YesYesYesNone
Llama CommunityLlama 3.3, Llama 4Yes (with attribution)RestrictedYes, with restrictions700M MAU limit
Google ToUGemma 3RestrictedNoConditions propagateRevocable

MIT and Apache 2.0 are equivalent for most commercial purposes. The key difference: Apache 2.0 includes an explicit patent grant, which matters in enterprise contexts. MIT does not.

The Llama Community License is permissive enough for most startups — you can build and sell a product. The 700 million monthly active user ceiling is irrelevant until you're larger than Facebook. The acceptable use policy restricts weapons development and child safety violations, which are not relevant to typical business applications.

Gemma's Google Terms of Use is the licensing position to be most careful about. If you fine-tune Gemma on proprietary data and distribute the resulting model (even internally), Google's terms create ambiguity about whether derivative models inherit the restrictions.

Key Insight: "Open source" and "open weights" are not the same thing. The Open Source Initiative published the Open Source AI Definition in October 2024: truly open source AI requires training data, code, and weights — all permissively licensed. By this definition, none of the frontier open-weight models (Llama, DeepSeek, Qwen, Mistral) qualify as open source. They are "open weights" with varying degrees of licensing freedom.

When proprietary APIs are still the right choice

Open models win on cost, privacy, and customizability. Closed APIs still win on specific dimensions that matter in some applications.

Peak agentic coding: On SWE-bench Verified, Claude Sonnet 4.5 scores 77.2%. Kimi K2.5 is now at 76.8% — closing the gap to less than a single percentage point. That said, proprietary APIs layer safety filtering, tool-use reliability, and guaranteed uptime on top. For a coding agent where a single bad output could break production, the managed reliability of a closed API is worth something.

Zero-ops deployment: OpenAI, Anthropic, and Google handle reliability, uptime SLAs, model updates, and safety filtering. For a team without ML infrastructure, starting on closed APIs and migrating to open models later is a reasonable path.

When you process fewer than 5 million tokens per month: At low volumes, API simplicity beats self-hosting economics. The crossover point in 2026 is roughly 5 to 10 million tokens per month, depending on the model tier and your team's infrastructure comfort.

For the coding assistant startup, the practical path is: prototype with Mistral Small 3.2 via Hugging Face Inference (fast, no ops), validate product-market fit, then migrate to self-hosted Qwen 2.5 Coder 32B or DeepSeek R1 distill on dedicated GPUs once you have predictable volume. The LLM quantization guide covers how to run these models on consumer hardware during the prototype phase.

Conclusion

The open source LLM field in March 2026 is the most competitive it has ever been. DeepSeek's iterative releases — V3, V3-0324, R1, R1-0528, V3.2 — demonstrated that a well-engineered MoE architecture with careful RL post-training can match and occasionally beat closed frontier models, and the model series keeps improving every quarter. Qwen 3 and Mistral Large 3 showed that Apache 2.0 licensing and frontier performance can coexist. Kimi K2.5 and GLM-5, released in early 2026, proved that the frontier of open-weight performance is no longer a DeepSeek monopoly.

For the startup building a coding assistant, the decision matrix is straightforward: use Qwen 2.5 Coder or DeepSeek R1 distills for single-GPU development and early production, migrate to Kimi K2.5 or Qwen3-235B on multi-GPU infrastructure for the best agentic quality, and choose Apache 2.0 licensed models (Qwen, Mistral) if clean commercial licensing matters to your legal team. Watch SWE-bench as the coding-specific benchmark that matters most for this use case.

If you're evaluating whether to fine-tune these models on your own data, see Fine-Tuning LLMs with LoRA and QLoRA — QLoRA makes fine-tuning a 70B model feasible on a single A100. And if you're building a retrieval layer on top of your chosen model, RAG: Making LLMs Smarter with Your Data covers the architecture decisions that matter most. For a broader view of where LLMs are heading, How Large Language Models Actually Work explains the transformer mechanics that underpin every model in this guide.

The gap between "best open" and "best closed" is narrowing every quarter. For most production applications that don't require absolute frontier capability, open models in 2026 are ready.

Interview Questions

What is the difference between open-source, open-weight, and closed LLMs?

Open-source LLMs (by the OSI definition) require training data, code, and weights all under permissive licenses — almost no frontier models qualify. Open-weight models release trained weights but not training data or code; Llama, DeepSeek, and Qwen fall here. Closed models release nothing — only API access. Most "open source AI" discussion refers to open-weight models with varying license terms.

Why does MoE architecture matter for open-weight models?

Mixture-of-Experts (MoE) architecture uses a routing layer to activate only a fraction of parameters per token. DeepSeek R1 has 671B total parameters but activates 37B per token; Kimi K2 has 1T total but activates only 32B. This means inference cost scales with active parameters, not total parameters, enabling frontier-quality knowledge capacity at manageable inference cost. The tradeoff is that all expert weights still need to fit in memory.

Your startup is building a coding assistant with proprietary client code. Which open model and deployment setup would you recommend?

Start with Qwen 2.5 Coder 32B or a DeepSeek R1 7B distill on Ollama for development — no data leaves your machine. For production, self-host on 1 to 2 H100s using vLLM or SGLang with the full DeepSeek R1-0528 or Kimi K2.5 for the best agentic coding quality. Qwen's Apache 2.0 license is the cleanest for commercial use; R1's MIT license is equally permissive. Avoid Gemma for fine-tuning scenarios due to licensing ambiguity.

How do you evaluate an open-weight LLM for a specific production task rather than relying on public benchmarks?

Public benchmarks (MMLU, HumanEval) measure general capability in controlled conditions that often don't match production distributions. Build a task-specific evaluation set from real examples in your domain — 100 to 500 labeled inputs and desired outputs. Run candidate models, score outputs with an automated judge or human review, and report pass rate and latency together. The model that wins on your eval often differs from the MMLU leaderboard winner.

What is the practical difference between Q4 and Q8 quantization for a 70B model?

Q4 quantization reduces each weight from 16-bit to 4-bit, cutting VRAM from ~140 GB to ~35 GB. Q8 uses 8-bit, requiring ~70 GB but preserving more precision. Quality difference: Q8 is nearly indistinguishable from FP16 on most tasks. Q4 shows 2 to 5% degradation on math, coding, and knowledge-intensive tasks. For a coding assistant, Q8 is worth the extra VRAM if you have it; Q4 is acceptable for prototyping on constrained hardware.

Why did DeepSeek R1 rely on GRPO rather than standard RLHF for training?

Standard RLHF requires training a separate reward model on human preference data, which is expensive and slow. Group Relative Policy Optimization (GRPO) eliminates the reward model by sampling multiple responses per prompt and using within-group score differences as the reward signal. This cuts RL training cost by roughly 50% and avoids reward model overfitting. The approach proved effective enough to achieve o1-parity on MATH-500.

When would you recommend SGLang over vLLM for production serving?

SGLang is the better choice for multi-turn workloads like coding assistants or chatbots where conversation history and system prompts repeat across requests. Its RadixAttention caches and reuses KV states, delivering 10 to 20% lower latency on multi-turn conversations and 29% higher throughput on H100s compared to vLLM overall. vLLM remains preferable for batch single-turn inference (document processing, content generation) and when you need the widest ecosystem compatibility or simplest setup.

When would you recommend a closed API over a self-hosted open model in 2026?

Three main scenarios: when your team lacks ML infrastructure expertise and can't absorb the ops overhead; when you need the absolute frontier of autonomous coding capability (even with Kimi K2.5 approaching Claude Sonnet 4.5 on SWE-bench, the managed reliability and safety filtering of closed APIs adds value for production agents); and when your token volume is below 5 million per month, making API simplicity cheaper than self-hosting overhead. Above that volume, with a capable infrastructure team, self-hosted open models win on total cost.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths