The gap between a robotic, repetitive chatbot and a creative, nuanced AI assistant comes down to a single decision point: how the model picks its next token. Every time a Large Language Model generates text, it faces a vocabulary of over 100,000 candidates and must choose just one. The algorithms that make this choice—temperature scaling, Top-K, Top-P, and the increasingly dominant Min-P—determine whether the output is boringly predictable, brilliantly creative, or incoherently random.
These parameters appear in every LLM API, yet most practitioners treat them as magic numbers. Understanding the math and intuition behind each one transforms you from a user who copies defaults into an engineer who can precisely control output quality for any use case.
From logits to a probability distribution
Before any sampling happens, a language model produces raw scores called logits—one for every token in its vocabulary. As explained in How Large Language Models Actually Work, these logits emerge from the final linear layer of the transformer. They are not probabilities: logits can be negative, exceed 1, and do not sum to anything meaningful.
The softmax function converts logits into a proper probability distribution:
In Plain English: Softmax exponentiates each logit (making everything positive), then divides by the total. A high logit becomes a high probability; a low logit becomes a near-zero probability. The result always sums to exactly 1 (100%).
Once we have probabilities, we need a strategy for selecting which token to generate. The simplest approach—greedy decoding—always picks the token with the highest probability. While this sounds optimal, it produces degenerate output. In experiments by Holtzman et al. (ICLR 2020), greedy decoding generated text where the same phrases repeated in loops, even when the underlying model was highly capable. The solution is sampling: randomly drawing from the distribution, weighted by probability. The question becomes: which parts of the distribution should we sample from?
Temperature as a confidence dial
Temperature is the most fundamental sampling parameter. It reshapes the probability distribution before any token selection happens by dividing every logit by the temperature value before softmax:
In Plain English: Temperature works like a contrast dial on a photograph. Low temperature () cranks up contrast—the already-bright pixels get brighter, the dark ones go black. The model becomes deterministic and focused. High temperature () reduces contrast—everything washes toward gray. The model becomes creative but risks incoherence.
Low versus high temperature in practice
Consider a model predicting the next word after "The mouse ate the..." with these logits:
| Token | Logit | P at T=1.0 | P at T=0.2 | P at T=2.0 |
|---|---|---|---|---|
| cheese | 3.0 | 68.6% | 99.3% | 50.6% |
| crumb | 2.0 | 25.3% | 0.7% | 30.7% |
| cable | 0.5 | 5.6% | ~0% | 14.5% |
| moon | -2.0 | 0.5% | ~0% | 4.2% |
At , "cheese" captures 99.3% of the probability mass—the output is effectively deterministic. At , even "moon" gets a 4.2% chance, opening the door to surprising (or nonsensical) completions.
Key Insight: Temperature does not remove options from the candidate pool. It only reweights them. Even at , every token in the vocabulary retains some nonzero probability. This is why truncation methods like Top-K and Top-P exist—to actually eliminate bad candidates.
Top-K sampling: hard cutoff
Top-K sampling, introduced by Fan et al. (ACL 2018) in their work on hierarchical story generation, takes a brute-force approach to taming the long tail:
- Sort all tokens by probability in descending order.
- Keep only the top tokens.
- Set the probability of everything else to zero.
- Renormalize so the remaining probabilities sum to 1.
In Plain English: Top-K says "I will only consider the best options, no matter what." If , the 51st-most-likely token is discarded regardless of how close its probability was to the 50th.
The static K problem
Top-K's fundamental flaw is that the "right" number of candidates changes with every token position:
High-certainty context: "The capital of France is..." — Only 1-2 reasonable completions exist ("Paris", maybe "located"). With , you force the model to consider 48 irrelevant options like "banana" or "running."
High-uncertainty context: "My favorite food is..." — Hundreds of valid completions exist. With , you cut off perfectly valid answers like the 51st most popular cuisine.
A fixed is simultaneously too large for confident predictions and too small for uncertain ones. This rigidity motivated the development of Top-P.
Top-P nucleus sampling: the dynamic window
Holtzman et al. (ICLR 2020) proposed nucleus sampling (Top-P), which adapts the candidate pool size based on the model's confidence:
- Sort tokens by probability descending.
- Accumulate probabilities one token at a time.
- Stop when the cumulative sum reaches the threshold (e.g., 0.95).
- Discard everything below the cutoff and renormalize.
In Plain English: Top-P says "Keep adding tokens to the candidate pool until I am % confident the right answer is in here." For "The capital of France is...", "Paris" alone might cover 95%—so the pool is just 1 token. For "My favorite food is...", the top option might only be 1%, so it takes 200 tokens to reach 95%—and all 200 stay in the pool.
Top-P became the industry standard. OpenAI, Anthropic, and Google all expose it as a primary API parameter, and it remains the default truncation method for commercial APIs in 2026.
The flat-distribution flaw
Top-P has a subtle weakness. When the model is genuinely confused (a flat probability distribution), reaching 95% cumulative probability requires including hundreds or thousands of tokens—many of which are low-quality noise. The more confused the model is, the more garbage Top-P lets through. This insight motivated Min-P.
Min-P: confidence-scaled filtering
Min-P, formalized by Nguyen et al. (ICLR 2025, Oral), introduces a threshold that scales dynamically with the model's own confidence:
Any token with probability below this threshold is discarded:
- Find the probability of the most likely token.
- Multiply it by the
min_pparameter (e.g., 0.1). - Discard every token whose probability falls below that threshold.
- Renormalize the survivors.
Example with min_p = 0.1:
| Scenario | Top Token | P(top) | Threshold | Effect |
|---|---|---|---|---|
| High confidence | "Paris" | 0.90 | 0.09 (9%) | Very strict — only strong candidates survive |
| Low confidence | "maybe" | 0.05 | 0.005 (0.5%) | Permissive — many options remain |
In Plain English: Min-P says "my standard for inclusion scales with how sure I am." When the model is confident, it demands high quality from every candidate. When the model is guessing, it relaxes and allows more variety. This single mechanism replaces the need to separately tune Top-K and Top-P.
The Min-P paper—ranked 18th highest-scoring submission at ICLR 2025 and accepted as an oral presentation—demonstrated that Min-P with values between 0.05 and 0.1 consistently outperforms Top-P, particularly at higher temperatures where Top-P's flat-distribution flaw is most damaging.
Min-P is now the default sampling method across the open-source ecosystem. It is natively supported in llama.cpp, vLLM, HuggingFace Transformers, Ollama, ExLlamaV2, KoboldCpp, and text-generation-webui. Commercial APIs (OpenAI, Anthropic, Google) do not yet expose Min-P as a parameter, making Top-P the best available truncation option for those platforms.
Pro Tip: For local or open-source deployments, use Min-P between 0.05 and 0.1 instead of Top-P. A value of 0.05 works well for creative tasks; 0.1 is better for general-purpose generation. Combined with moderate temperature (0.7-1.0), Min-P produces noticeably more coherent output than Top-P at equivalent diversity levels.
Top-n-sigma: temperature-invariant truncation
The newest entrant in the sampling landscape is Top-n-sigma, published by Tang et al. (ACL 2025). It addresses a problem that affects every probability-based truncation method: temperature coupling.
With Top-P and Min-P, changing the temperature inadvertently changes which tokens survive truncation. A higher temperature flattens probabilities, causing Top-P to include more tokens—even if the goal was only to increase randomness among the same candidates. Top-n-sigma decouples these two concerns by operating on raw logits instead of probabilities:
where is the largest logit and is the standard deviation of all logits. Any token with a logit below this threshold is masked out before softmax.
In Plain English: Top-n-sigma says "keep any token within standard deviations of the best token." Because both the maximum and the standard deviation scale identically when divided by temperature, the set of surviving tokens is mathematically identical regardless of temperature. Temperature then only affects the relative probabilities among the survivors.
The implementation is two lines of code:
threshold = logits.max() - n * logits.std()
logits[logits < threshold] = float('-inf')
The paper recommends a default of . Top-n-sigma is integrated into the llama.cpp sampler chain as of early 2026, though it is disabled by default (set to -1) and must be explicitly enabled by the user.
Other notable decoding strategies
Several additional methods are worth understanding for specialized use cases:
Mirostat (Basu et al., ICLR 2021) takes a control-theory approach. Instead of fixing parameters, Mirostat monitors the perplexity (surprise level) of generated text in real time and adjusts truncation dynamically. If output becomes too repetitive (low perplexity), Mirostat loosens constraints. If it becomes too chaotic (high perplexity), Mirostat tightens them. It is available in llama.cpp as Mirostat v1 and v2, and is particularly useful for long-form generation where consistent quality matters more than per-token control.
Typical sampling (Meister et al., TACL 2022) keeps tokens whose information content (negative log probability) is close to the expected information content (entropy). Rather than favoring high-probability tokens, typical sampling favors tokens that are typically surprising—neither too predictable nor too unlikely. This produces text that more closely matches the statistical properties of human writing.
Contrastive decoding (Li et al., ACL 2023) compares the output distributions of a large "expert" model and a small "amateur" model, then amplifies tokens where the expert disagrees most with the amateur. O'Brien and Lewis (2023) showed this improves reasoning by suppressing the superficial patterns that small models exploit.
Speculative decoding (Leviathan et al., ICML 2023; Chen et al., 2023) is an acceleration technique rather than a quality-altering strategy. A small "draft" model generates candidate tokens cheaply, and the large model verifies them in a single batched forward pass. When the draft tokens are correct—which they frequently are for common patterns—you get multiple tokens for the cost of one large-model inference step, achieving 2-3x speedups without changing output quality.
The sampler processing pipeline
When multiple sampling strategies are combined, the order they run in changes the output. The default chain in llama.cpp (as of early 2026) is:
logits → penalties → dry → top_n_sigma → top_k → typical → top_p → min_p → xtc → temperature → sample
The chain includes DRY (Don't Repeat Yourself, a context-aware repetition suppressor) and XTC (Exclude Top Choices, which occasionally removes the top token to force variety). Both are disabled by default but occupy fixed positions in the pipeline.
Three critical observations about this order:
Penalties run first. Repetition, frequency, and presence penalties are applied before any truncation. This ensures penalized tokens are deprioritized before the probability mass is redistributed by Top-K or Top-P. If penalties were applied after truncation, they might penalize tokens that were already eliminated, having no effect.
Temperature runs last. In llama.cpp, temperature is applied just before the final random sample. This means you can use high temperatures without fear—noise tokens have already been removed by the truncation samplers upstream. Setting with temperature-last produces deterministic (greedy) output regardless of Top-P or Min-P settings.
HuggingFace Transformers applies temperature first. This means the same parameter values produce different outputs depending on the backend. If you tune sampling parameters in one framework and deploy with another, the behavior will change. Always verify the sampler order of your specific framework.
Common Pitfall: Many practitioners tune sampling parameters on one framework (say, the OpenAI API) and deploy on another (say, vLLM or llama.cpp). Because the sampler order differs, the same parameter values produce different text. Always test with the exact backend you will use in production.
Provider defaults and the reasoning model exception
Every major LLM API ships different default sampling parameters. Understanding them prevents unexpected behavior:
| Provider | Temperature | Top-P | Top-K | Penalties |
|---|---|---|---|---|
| OpenAI (GPT-4o) | 1.0 | 1.0 | Not exposed | freq=0, presence=0 |
| Anthropic (Claude) | 1.0 | Not set | Not set | Not exposed |
| Google (Gemini) | 1.0 | 0.95 | 40 | freq=0, presence=0 |
| Meta (Llama via vLLM) | 1.0 | 1.0 | Disabled | rep_penalty=1.0 |
| DeepSeek | 1.0 | Not set | Not exposed | freq=0, presence=0 |
The DeepSeek temperature mapping. DeepSeek V3 implements a hidden linear mapping: . An API temperature of 1.0 actually runs the model at an internal temperature of 0.3. DeepSeek chose this because most users leave temperature at the default, and their testing found 0.3 to be optimal. If you set the API temperature to 2.0, the model runs at an internal 0.6—still conservative by normal standards.
Reasoning models lock their parameters. OpenAI's o1, o3, and o4-mini series fix temperature at 1.0 and Top-P at 1.0—these parameters must be omitted entirely from API requests, or the API rejects the call with an "unsupported parameter" error. The reason: reasoning models run multiple internal chains of thought, evaluate them, and select the best. If temperature were set to 0, all chains would collapse to the identical greedy path, defeating the purpose of multi-path reasoning. Anthropic's Claude with extended thinking blocks temperature and top_k modifications entirely, allowing only Top-P adjustments within the narrow range of 0.95 to 1.0. For more on how these models work internally, see Reasoning Models: How AI Learned to Think Step by Step.
Repetition, frequency, and presence penalties
These three parameters control how aggressively the model avoids repeating itself. They are often confused, but work quite differently.
Repetition penalty (introduced in the CTRL paper, Keskar et al. 2019) is multiplicative. The original CTRL paper simply divided the logits of previously seen tokens by . However, this has a well-known bug: dividing a negative logit by makes it less negative, actually increasing its probability. The corrected version, now standard in HuggingFace Transformers and most frameworks, handles the sign:
where is the penalty factor (typical default: 1.0, meaning no penalty; common setting: 1.1-1.2).
Frequency and presence penalties (OpenAI-style) are additive:
where is the count of prior occurrences of token . Frequency penalty scales linearly with count—saying "the" 10 times incurs 10x the penalty of saying it once. Presence penalty is a flat one-time deduction applied to any token that has appeared at all, regardless of frequency. Use frequency penalty to reduce word-level repetition; use presence penalty to encourage topic diversity.
Python implementation from scratch
The following code implements the five core sampling methods and demonstrates their behavior on a simulated vocabulary. All output values have been verified by execution.
import numpy as np
from scipy.special import softmax
def apply_temperature(logits, temperature=1.0):
if temperature == 0:
# Greedy decoding: all probability on the top token
probs = np.zeros_like(logits, dtype=float)
probs[np.argmax(logits)] = 1.0
return probs
return softmax(logits / max(temperature, 1e-7))
def sample_top_k(probs, k=5):
top_k_idx = np.argsort(probs)[-k:]
mask = np.zeros_like(probs, dtype=bool)
mask[top_k_idx] = True
filtered = np.where(mask, probs, 0.0)
return filtered / filtered.sum()
def sample_top_p(probs, p=0.9):
sorted_idx = np.argsort(probs)[::-1]
cumsum = np.cumsum(probs[sorted_idx])
cutoff = min(np.searchsorted(cumsum, p), len(probs) - 1)
keep = sorted_idx[:cutoff + 1]
filtered = np.zeros_like(probs)
filtered[keep] = probs[keep]
return filtered / filtered.sum()
def sample_min_p(probs, min_p=0.1):
threshold = np.max(probs) * min_p
filtered = np.where(probs >= threshold, probs, 0.0)
return filtered / filtered.sum()
def sample_top_n_sigma(logits, n=1.0):
threshold = logits.max() - n * logits.std()
masked = np.where(logits >= threshold, logits, -np.inf)
return softmax(masked)
# Simulated vocabulary and logits (raw model output)
vocab = ["apple", "banana", "cherry", "date", "elderberry",
"fig", "grape", "honeydew", "ice", "jackfruit"]
logits = np.array([2.0, 1.5, 0.5, 0.1, -1.0,
3.5, 0.2, -0.5, 0.0, 0.1])
# Base distribution at T=1.0
base_probs = softmax(logits)
print("Base probabilities (T=1.0):")
for w, p in zip(vocab, base_probs):
if p > 0.01: # Only show tokens above 1%
print(f" {w}: {p:.4f}")
# Temperature comparison
for T in [0.5, 2.0]:
probs = apply_temperature(logits, T)
top = vocab[np.argmax(probs)]
print(f"\nTemperature T={T}: top={top} ({probs.max():.4f})")
# Top-K with K=3
top_k = sample_top_k(base_probs, k=3)
survivors = [(w, f"{p:.4f}") for w, p in zip(vocab, top_k) if p > 0]
print(f"\nTop-K (K=3): {survivors}")
# Min-P with min_p=0.1
min_p = sample_min_p(base_probs, min_p=0.1)
threshold = np.max(base_probs) * 0.1
survivors = [(w, f"{p:.4f}") for w, p in zip(vocab, min_p) if p > 0]
print(f"Min-P (min_p=0.1, threshold={threshold:.4f}): {survivors}")
# Top-n-sigma with n=1.0
ns = sample_top_n_sigma(logits, n=1.0)
survivors = [(w, f"{p:.4f}") for w, p in zip(vocab, ns) if p > 0.001]
print(f"Top-n-sigma (n=1.0): {survivors}")
Expected output:
Base probabilities (T=1.0):
apple: 0.1420
banana: 0.0861
cherry: 0.0317
date: 0.0212
fig: 0.6363
grape: 0.0235
honeydew: 0.0117
ice: 0.0192
jackfruit: 0.0212
Temperature T=0.5: top=fig (0.9298)
Temperature T=2.0: top=fig (0.3295)
Top-K (K=3): [('apple', '0.1643'), ('banana', '0.0996'), ('fig', '0.7361')]
Min-P (min_p=0.1, threshold=0.0636): [('apple', '0.1643'), ('banana', '0.0996'), ('fig', '0.7361')]
Top-n-sigma (n=1.0): [('fig', '1.0000')]
Notice how each method produces a different candidate pool from the same base distribution:
- Temperature changes relative probabilities but keeps all 10 tokens. At , "fig" dominates at 93%; at , it drops to 33%.
- Top-K (K=3) keeps exactly 3 tokens: "fig", "apple", "banana"—regardless of the probability gap between the 3rd and 4th options.
- Min-P (0.1) produces the same 3 survivors in this case because the threshold (6.36%) happens to cut at the same boundary. But with a flatter distribution, Min-P would keep more tokens than Top-K=3.
- Top-n-sigma (n=1.0) is the most aggressive here: only "fig" (logit 3.5) survives because it is the only token within 1 standard deviation of the maximum logit. Increasing to 2.0 would include "apple" and "banana" as well.
Recommended settings by use case
| Use Case | Temperature | Truncation | Why |
|---|---|---|---|
| Code generation | 0.0 - 0.2 | Min-P 0.1 or Top-P 0.95 | Syntax requires precision; wrong tokens cause errors |
| Math and reasoning | 0.0 - 0.3 | Top-P 0.95 | Logical chains break with randomness |
| Factual Q&A | 0.3 - 0.7 | Min-P 0.1 | Balance accuracy with natural variation |
| General chat | 0.7 - 1.0 | Min-P 0.05 or Top-P 0.9 | Natural conversation needs some unpredictability |
| Creative writing | 1.0 - 1.5 | Min-P 0.05 | High diversity without incoherent noise |
| Brainstorming | 1.2 - 1.8 | Min-P 0.02 - 0.05 | Maximum variety; accept occasional oddity |
Pro Tip: For commercial APIs that do not expose Min-P, use Top-P between 0.9 and 0.95. For open-source deployments, prefer Min-P between 0.05 and 0.1 with Temperature and skip Top-K and Top-P entirely. This two-parameter setup (Temperature + Min-P) is what most llama.cpp and vLLM power users have converged on as of early 2026.
Conclusion
LLM sampling is ultimately about navigating the trade-off between coherence and creativity. Temperature reshapes the probability landscape. Top-K imposes a hard boundary. Top-P adapts to confidence. Min-P scales with the model's own certainty. And Top-n-sigma cleanly separates "which tokens to consider" from "how randomly to choose among them."
The field has evolved significantly since the early days of Top-K. The research trajectory—from static truncation (Fan et al., 2018) to dynamic nucleus sampling (Holtzman et al., 2020) to confidence-scaled filtering (Nguyen et al., 2025) to temperature-invariant methods (Tang et al., 2025)—reflects a steady march toward samplers that require less manual tuning and produce better text across a wider range of conditions.
For practitioners in 2026, the practical advice is straightforward: use Temperature + Min-P for open-source deployments, Temperature + Top-P for commercial APIs, and leave reasoning models at their locked defaults. The days of needing to understand five interacting parameters are ending—the modern approach is two well-chosen knobs.
To understand how the vocabulary these samplers operate on is constructed, read Tokenization Deep Dive: Why It Matters More Than You Think. For the vector representations that power semantic search over LLM outputs, see Text Embeddings: The Foundation of Semantic Search. And to explore how the logits themselves are generated by the transformer architecture, start with How Large Language Models Actually Work.