Open Source vs Closed LLMs: Choosing the Right Model in 2026

DS
LDS Team
Let's Data Science
9 min readAudio
Listen Along
0:00 / 0:00
AI voice

In January 2025, a Chinese AI lab released a reasoning model under an MIT license that matched OpenAI's o1 on most benchmarks. It cost 5.9milliontotrain.Nvidialost5.9 million to train. Nvidia lost 589 billion in market value in a single day. The "open vs closed" debate was no longer theoretical — it had real numbers behind it, and those numbers were shocking. If you're building with LLMs in 2026, the choice between open and closed models is the most consequential architectural decision you'll make. This guide gives you the facts, the economics, and the framework to make it well.

The performance gap that disappeared

Two years ago, picking an open model meant accepting significantly worse quality. The MMLU benchmark — a widely used measure of broad knowledge — told a clear story: at the end of 2023, the best closed model scored around 88% while the best open model managed roughly 70.5%, a gap of 17.5 percentage points.

Then 2024 happened.

By December 2024, DeepSeek V3 scored 88.5 on MMLU — higher than GPT-4o's 87.2. The gap had collapsed from 17.5 points to effectively zero in a single year. The Stanford AI Index 2025 Report confirmed this convergence: across multiple benchmarks, the performance gap between the best open and closed models had narrowed dramatically, with open models matching or exceeding closed models on several key evaluations.

This wasn't a fluke from one model. The convergence happened across the board:

BenchmarkBest Open Model (Early 2026)ScoreBest Closed ModelScore
MMLUDeepSeek V388.5GPT-4o87.2
MATH-500DeepSeek R197.3OpenAI o1~96.0
AIME 2024Qwen3-235B (Thinking)85.7OpenAI o396.7
SWE-bench VerifiedKimi K265.8%Claude Sonnet 4.577.2%
MMLU-ProMistral Large 373.1%Gemini 3 Pro89.8%
LMArena EloDeepSeek V3.2~1460Gemini 3 Pro1501

The pattern is clear: open models now match or beat closed models on knowledge and math, while closed models maintain a lead on production coding (SWE-bench), complex instruction following, and composite reasoning benchmarks. That remaining gap is narrowing, but it's real.

The open model landscape in 2026

The open model ecosystem in 2026 looks nothing like it did even 18 months ago. Five model families dominate, each with distinct strengths.

DeepSeek: the cost-efficiency pioneer

DeepSeek R1, released January 20, 2025, remains the watershed moment for open models. Built on the DeepSeek V3 base (671B total parameters, 37B active per token via Mixture-of-Experts), DeepSeek R1 achieved frontier-level reasoning through pure reinforcement learning. On MATH-500, DeepSeek R1 scored 97.3 — beating OpenAI's o1. On MMLU, it reached 90.8 (vs o1's 91.8).

The training cost made headlines: the full pipeline cost approximately 5.9million(5.9 million (5.6M for the V3 base model, ~$294K for the RL phase). Key innovations included FP8 mixed-precision training at scale, Multi-Head Latent Attention (MLA) reducing KV cache by 93.3%, and Group Relative Policy Optimization (GRPO) cutting RL costs by roughly 50%.

DeepSeek continued iterating throughout 2025. DeepSeek-V3-0324 (March 2025) outperformed GPT-4.5 on math and coding. DeepSeek-R1-0528 (May 2025) became the second-highest model on AIME, behind only OpenAI's o3. By late 2025, DeepSeek V3.2-Speciale reached Gemini-3.0-Pro-level performance and won gold medals at the 2025 International Mathematical Olympiad (IMO) and International Olympiad in Informatics (IOI).

DeepSeek R1 ships under an MIT license — fully permissive, no restrictions.

Qwen 3: the Apache 2.0 powerhouse

Alibaba's Qwen 3 family, released April 28, 2025, brought a critical innovation: hybrid thinking modes. A single model can toggle between fast "non-thinking" mode and slow "thinking" mode (similar to OpenAI's o1-style chain-of-thought reasoning), controlled by a simple parameter.

The flagship Qwen3-235B-A22B uses a MoE architecture (235B total, 22B active) trained on 36 trillion tokens across 119 languages. On thinking-mode benchmarks, Qwen3-235B outperforms DeepSeek R1 on 17 of 23 benchmarks despite having only 35% of R1's total parameters and 60% of its active parameters.

Every Qwen3 model — from the 0.6B to the 235B — ships under Apache 2.0. This is significant. Apache 2.0 allows unrestricted commercial use, modification, and redistribution with no monthly active user limits, no acceptable use policies, and no "phone home" requirements.

Llama 4: scale meets restriction

Meta's Llama 4 family (April 5, 2025) pushed the boundaries of context length and scale. Llama 4 Scout offers a 10-million-token context window with 109B total parameters (17B active), while Maverick uses 400B total parameters (17B active, 128 experts) with a 1-million-token context window. The still-in-training Behemoth targets approximately 2 trillion total parameters.

Llama 4 Maverick scores 85.5 on MMLU and 86.4% on HumanEval. Meta claims Behemoth outperforms GPT-4.5 and Claude Sonnet 3.7 on STEM benchmarks, with preliminary MATH-500 scores around 95.0.

However, Llama 4 ships under the Llama Community License — not Apache 2.0, not MIT, and not OSI-approved open source. The license includes a hard restriction: any organization with 700 million or more monthly active users must obtain separate permission from Meta. The license also requires "Built with Llama" attribution and includes an acceptable use policy.

Mistral Large 3: the European contender

Mistral Large 3, released December 2, 2025, is a 675B total parameter MoE model (41B active) with a 256K-token context window. Mistral Large 3 scores 73.11% on MMLU-Pro and 93.60% on MATH-500, placing it among the top open models on reasoning benchmarks.

The significant development here is licensing: Mistral Large 3 ships under Apache 2.0. Mistral previously used restrictive custom licenses for its larger models, making this a milestone for the open ecosystem.

The rest of the field

Microsoft Phi-4 (January 2025, 14B parameters, MIT license): Phi-4-reasoning-plus approaches full DeepSeek R1 performance on AIME 2025 — a 14B model rivaling a 671B model. This makes Phi-4 the poster child for the "small language model revolution" that MIT Technology Review named a Breakthrough Technology of 2025.

Google Gemma 3 (March 2025, up to 27B parameters): Gemma-3-27B beats the original Gemini 1.5-Pro across benchmarks. However, Gemma ships under Google's Terms of Use — not a permissive open-source license. Google reserves the right to remotely restrict usage, and license conditions propagate to models trained on Gemma's synthetic outputs.

Kimi K2 (Moonshot AI, 2025, 1 trillion total parameters, 32B active): K2 scores 65.8% on SWE-bench Verified, placing it among the top models for agentic coding tasks.

The closed model landscape in 2026

Closed models in 2026 compete primarily on three axes: peak capability, ease of integration, and specialized features.

ModelProviderReleaseContextInput $/MTokOutput $/MTok
Claude Opus 4.6AnthropicFeb 2026200K (1M beta)$5.00$25.00
GPT-5.2OpenAIDec 2025400K$1.75$14.00
GPT-5OpenAIAug 2025400K$1.25$10.00
Gemini 3 ProGoogleNov 20251M$2.00$12.00
Claude Sonnet 4.5AnthropicSep 2025200K (1M beta)$3.00$15.00
o3OpenAIApr 2025200K$2.00$8.00
GPT-4.1OpenAIApr 20251M$2.00$8.00

For cost-conscious applications, budget-tier closed models are remarkably cheap:

ModelProviderInput $/MTokOutput $/MTok
GPT-4.1 nanoOpenAI$0.10$0.40
GPT-4.1 miniOpenAI$0.40$1.60
Gemini 2.5 FlashGoogle$0.50$3.00
Claude Haiku 4.5Anthropic$1.00$5.00
o4-miniOpenAI$1.10$4.40

All major providers now offer prompt caching with 50-90% discounts on repeated input tokens. Google and Anthropic lead with 90% cached input discounts across their model lines.

Key Insight: LLM inference costs have dropped approximately 10x annually. GPT-4-equivalent performance cost 20permilliontokensinlate2022andapproximately20 per million tokens in late 2022 and approximately 0.40 per million tokens in early 2026.

The licensing spectrum: "open" doesn't mean what you think

One of the most misunderstood aspects of the open vs closed debate is what "open" actually means. The Open Source Initiative (OSI) published the Open Source AI Definition (OSAID) 1.0 in October 2024, establishing clear criteria: an open-source AI model must provide sufficiently detailed data information, complete training code, and model weights — and must allow use for any purpose, study, modification, and sharing without permission.

By this definition, almost none of the popular "open" models qualify. DeepSeek R1, Llama 4, Qwen 3, and Mistral Large 3 all release weights but not training data. The models that do meet OSI's standard — Pythia (EleutherAI), OLMo (AI2), Amber and CrystalCoder (LLM360), and T5 (Google) — are not the ones dominating benchmark leaderboards.

What most people call "open-source LLMs" are more accurately called "open-weight models." The practical distinction matters because the license attached to those weights determines what you can actually do:

License TierLicenseModelsKey Restrictions
Fully permissiveMITDeepSeek R1, Phi-4None
Fully permissiveApache 2.0Qwen 3, Mistral Large 3None
Conditionally permissiveLlama CommunityLlama 4, Llama 3.x700M MAU limit, attribution required, acceptable use policy
RestrictiveGemma ToUGemma 3Google can remotely restrict usage, conditions propagate to derivative models
Non-commercialCC-BY-NCCommand R+No commercial use

For enterprise deployments, this spectrum matters enormously. A company building a product on Llama 4 faces different legal exposure than one building on Qwen 3 — even though both are commonly called "open source."

The economics: when self-hosting beats API calls

The decision to self-host an open model vs calling a closed API is fundamentally an economic calculation. The math has shifted dramatically in favor of self-hosting at scale.

API costs at different volumes

Consider a production application processing 10 million tokens per day (roughly 7.5 million words — a mid-size customer support chatbot or document processing pipeline):

ApproachMonthly CostAnnual Cost
GPT-5 API (50/50 input/output split)~$51,750~$621,000
Claude Sonnet 4.5 API~$41,400~$496,800
Gemini 3 Pro API~$32,200~$386,400
Llama 3.3 70B on Groq (managed)~$10,062~$120,744
Llama 3.3 70B self-hosted (2x H100)~$4,320~$51,840

Pro Tip: The breakeven point between premium closed APIs and self-hosted open models is approximately 5-10 million tokens per month. Below that volume, APIs are simpler and cheaper. Above it, self-hosting saves 50-90% annually.

Hardware requirements and costs

Running open models requires understanding VRAM constraints:

Model SizeFP16 VRAMINT4 (Q4) VRAMMinimum Hardware
7-8B~14 GB~4-5 GB1x RTX 4090 (24 GB)
14B~28 GB~8 GB1x RTX 4090 or A100
70B~140 GB~35 GB2x RTX 5090 or 1x H100
405B~972 GB~243 GB8x H100 (640 GB) with 4-bit quant

Cloud GPU rental prices in early 2026: H100 80GB runs 1.493.90/hourdependingonprovider(Vast.aiat1.49-3.90/hour depending on provider (Vast.ai at 1.49, RunPod at 1.992.69,Lambdaat1.99-2.69, Lambda at 2.99, AWS at 3.90).A10080GBhasdroppedto3.90). A100 80GB has dropped to 0.66-0.78/hour on spot markets.

The consumer GPU story is compelling. Dual RTX 5090s (~$4,000 total hardware cost) now match single-H100 throughput for 70B models at 25% of the rental cost. The RTX 5090 pushes 213 tokens per second on 8B models and handles quantized 70B models at interactive speeds.

The hidden costs of self-hosting

Self-hosting isn't just GPU rental. Realistic total cost of ownership includes:

  • Infrastructure staff: MLOps engineers to manage deployment, monitoring, and updates
  • Inference optimization: Choosing and tuning serving frameworks (vLLM, SGLang, TensorRT-LLM)
  • Quantization trade-offs: 4-bit quantization (Q4_K_M) retains approximately 92% of original quality; AWQ retains approximately 95%. This quality loss may or may not matter for your use case
  • Uptime and redundancy: No SLA unless you build one yourself
  • Model updates: When a better model releases, you handle the migration

A realistic minimal deployment costs $125,000-190,000 annually when factoring in staff time, infrastructure, and operations — not just raw GPU hours.

Where closed models still win

Despite the benchmark convergence, closed models maintain meaningful advantages in several areas.

Advanced reasoning at the frontier

OpenAI's o3 still leads on the hardest reasoning tasks:

Benchmarko3DeepSeek R1Gap
AIME 2024 (math competition)96.7%79.8%+16.9 pts
GPQA Diamond (graduate reasoning)87.7%71.5%+16.2 pts
Codeforces (competitive coding)2727 Elo2029 Elo+698 pts
SWE-bench Verified (real bugs)71.7%49.2%+22.5 pts

These gaps are large. For applications that require peak reasoning — complex code generation, multi-step mathematical proofs, or graduate-level scientific reasoning — closed frontier models like o3 and Gemini 3 Pro still outperform their open counterparts by a significant margin.

Latency

In benchmarks, o3 completes complex reasoning tasks in approximately 27 seconds versus approximately 1 minute 45 seconds for DeepSeek R1 on comparable tasks. Closed API providers invest heavily in inference optimization that individual teams cannot replicate.

Managed infrastructure

No deployment to manage, no VRAM to calculate, no models to update. For teams without MLOps expertise, the simplicity of an API call is worth the premium. You also get automatic safety patches — when a vulnerability is discovered, the provider fixes it server-side without any action from you.

Instruction following

On IFEval (instruction-following evaluation), GPT-4.1 scores 87.4% — meaningfully ahead of open alternatives. For applications where precise adherence to complex multi-constraint instructions matters (structured data extraction, multi-format output generation), closed models still have an edge.

Where open models win

Data sovereignty and compliance

For healthcare organizations subject to HIPAA, financial institutions under FINRA, any company operating under GDPR (with fines up to 4% of global annual turnover), or organizations subject to the EU AI Act (with penalties up to 7% of global annual turnover), running models on your own infrastructure isn't a preference — it's often a regulatory requirement. Closed model APIs may use interactions to train future models, and LLMs have demonstrated the ability to memorize and reproduce verbatim training examples under targeted querying.

Fine-tuning on proprietary data

Closed APIs offer limited fine-tuning through parameter-efficient methods, but true weight access is exclusive to open models. The fine-tuning ecosystem in 2026 is mature:

  • LoRA fine-tuning recovers 90-95% of full fine-tuning quality while training only 0.1-1% of parameters
  • QLoRA enables fine-tuning a 7B model on a single RTX 4090 (24 GB VRAM) or a 70B model on a single A100 (80 GB)
  • Cost: Fine-tuning Mistral 7B with LoRA on 8,000 annotated documents takes approximately 16 hours on a single A100, costing roughly $120
  • Tools: Unsloth delivers 2x faster fine-tuning with 70% less VRAM; Axolotl provides YAML-driven configuration for the full pipeline

For domain-specific applications — legal document analysis, medical coding, financial compliance — a fine-tuned 7B open model often outperforms a general-purpose frontier model while running on a single consumer GPU.

No vendor lock-in

OpenAI retired 33 models in January 2025 alone. When GPT-5 launched, its changed model routing broke production workflows overnight — companies with hundreds of prompt-based automations had to debug broken integrations and re-engineer prompt logic under pressure. OpenAI is planning to retire GPT-4o, GPT-4.1, and o4-mini, pushing developers toward GPT-5.x.

With open models, you control the version. You can freeze a model that works and run it indefinitely. No deprecation notices, no forced migrations, no breaking changes to your prompt stack.

Managed open model hosting: the middle ground

You don't have to self-host to use open models. Managed inference providers offer open models via API at significantly lower cost than closed alternatives:

ProviderLlama 3.3 70B (input/output $/MTok)Key Feature
Groq0.59/0.59 / 0.79LPU hardware, 300 tok/s (10x H100 speed)
Cerebras0.60/0.60 / 0.60WSE-3 chip, 2,100 tok/s on 70B models
Together AI$0.88 (blended)GPU clusters, DeepSeek R1 at 0.55/0.55/2.19
DeepInfra$0.36 (blended)Lowest cost for 70B

Compare Groq's 0.59/0.59/0.79 for Llama 3.3 70B against GPT-4o's 2.50/2.50/10.00 — that's roughly a 4-13x cost reduction while maintaining competitive quality.

A decision framework

The open vs closed choice isn't binary. Most production systems in 2026 use a hybrid approach. Here's a practical framework:

Choose closed APIs when:

  • You need peak reasoning performance (o3, Gemini 3 Pro)
  • Your volume is under 5 million tokens per month
  • You lack MLOps expertise and don't want to build it
  • Speed to market matters more than per-token cost
  • You need the absolute best instruction-following precision

Choose open models when:

  • You process more than 10 million tokens per month
  • Data must stay on your infrastructure (regulatory or compliance requirements)
  • You need to fine-tune on proprietary data
  • Vendor lock-in is a strategic risk
  • You want to freeze a specific model version long-term

Choose managed open model hosting when:

  • You want open model economics without self-hosting complexity
  • Latency matters (Groq and Cerebras offer industry-leading inference speed)
  • You want to switch models freely without rewriting integration code

Pro Tip: A common production pattern is routing: use a fast, cheap open model (7-14B, self-hosted or on Groq) for 80% of requests, and escalate to a frontier closed model (o3 or Gemini 3 Pro) for the 20% that require maximum capability. This can reduce costs by 70-80% versus using a frontier model for everything.

The MoE revolution driving convergence

A key architectural trend explains why open models caught up so quickly: Mixture-of-Experts (MoE). Every major open model released in 2025 uses MoE:

ModelTotal ParametersActive ParametersEfficiency Ratio
DeepSeek V3/R1671B37B5.5%
Llama 4 Maverick400B17B4.3%
Qwen3-235B235B22B9.4%
Mistral Large 3675B41B6.1%
Kimi K21,000B32B3.2%

MoE models activate only a small fraction of their total parameters per token, achieving knowledge capacity of a massive model with the inference cost of a much smaller one. DeepSeek V3 has 671B total parameters but only 37B fire per token — meaning inference cost is comparable to a 37B dense model, while knowledge breadth approaches that of a 671B dense model.

This architecture, combined with innovations like Multi-Head Latent Attention (93.3% KV cache reduction in DeepSeek) and FP8 training, is why a $5.9M training run can now produce frontier-level models. The brute-force scaling hypothesis — that more compute always wins — has been challenged by architectural efficiency.

Conclusion

The open vs closed LLM landscape in 2026 is defined by one fact: the capability gap has largely closed, but the deployment trade-offs haven't. DeepSeek R1 proved that frontier-level reasoning doesn't require hundreds of millions in training budget. Qwen 3 and Mistral Large 3 proved that Apache 2.0 licensing and world-class performance aren't mutually exclusive. And the pricing data proves that self-hosting open models is economically compelling above 5-10 million tokens per month.

But closed models aren't standing still. Gemini 3 Pro broke the 1500 Elo barrier on LMArena. Claude Sonnet 4.5 leads SWE-bench. And for teams that value simplicity over optimization, the managed API experience remains hard to beat.

The winning strategy in 2026 isn't picking a side — it's understanding the trade-offs well enough to pick the right model for each use case. Route intelligently, benchmark on your specific workload, and don't let ideology substitute for measurement.

For deeper context on how LLMs process and retrieve information under the hood, see our guide on How Retrieval-Augmented Generation (RAG) Actually Works. If you're interested in Meta's architectural innovations beyond standard transformers, check out The Ultimate Guide to Meta's Large Concept Models.