Understanding LLM Generation Parameters: Temperature, Top‑k, Top‑p, Penalties, and Max Tokens
The article explains how logits are transformed into probabilities via softmax and how generation parameters such as temperature, top‑k, top‑p, frequency‑penalty, presence‑penalty, and max_tokens intervene in the logits‑to‑sampling pipeline, detailing their mechanisms, common misconceptions, and practical limitations.
Logits and Softmax
The model’s final linear layer outputs a raw score for each token in the vocabulary, called logits . Logits are not probabilities and have no intrinsic meaning when summed. The softmax function converts this set of logits into a valid probability distribution, from which the next token is sampled. All generation parameters intervene in the "logits → probability distribution → sampling" pipeline.
temperature
Role
Controls the randomness of the output. Lower values make the output more deterministic; higher values increase diversity.
Principle
Each logit is divided by the temperature value before being fed to softmax.
Limitations
Temperature only rescales the existing distribution and cannot change the relative ranking of tokens. If the model assigns a high native probability to an incorrect answer, a low temperature will amplify that error rather than correct it. For inference‑oriented models such as o4‑mini, temperature mainly affects surface wording, not factual accuracy.
Common Misconceptions
Misconception: temperature=0 yields identical outputs every time. Mathematically it reduces to argmax (greedy decoding), but in practice four factors break determinism:
Floating‑point non‑associativity: GPU operations do not satisfy the associative law, causing tiny ULP differences when reduction order varies across runs.
MoE routing nondeterminism: The same prompt may be batched differently, leading to different expert selections.
Cloud service load balancing: Requests may hit different hardware or driver versions, altering batch composition.
Framework nondeterministic operators: Certain PyTorch/CUDA kernels sacrifice determinism for speed.
For reproducible production scenarios (e.g., audit or automated testing), fixing the seed and using single‑GPU, single‑process setups is required.
top_k
Role
Truncates the candidate set before sampling, keeping only the top‑k tokens with highest probability and discarding the rest, preventing the model from sampling extremely low‑probability “garbage” tokens.
Principle
After selecting the top‑k tokens, their probabilities are renormalized to sum to 1 and then sampling proceeds. When k=1, the method collapses to greedy decoding (equivalent to temperature=0).
Limitations
The size of the candidate set is fixed regardless of the shape of the distribution:
When the distribution is highly concentrated (model very certain), the top‑1 or top‑2 tokens dominate, and a large k (e.g., 50) still admits many low‑quality tail tokens.
When the distribution is flat (model uncertain), a small k (e.g., 3) may cut off many reasonable candidates, harming diversity.
This “one‑size‑fits‑all” rigidity is a weakness of top_k and motivated the introduction of top_p.
Common Misconceptions
Misconception: larger k always improves quality. Increasing k does not guarantee better results; in a sharp distribution, a larger k can introduce low‑probability, incoherent tokens.
Misconception: top_k and temperature are interchangeable dimensions of randomness. They act at different stages: temperature reshapes the distribution before softmax, affecting all tokens, while top_k truncates the candidate pool after softmax, physically removing low‑probability tokens.
top_p
Role
Dynamically truncates the candidate set based on a probability threshold p, adapting the size of the nucleus to the model’s confidence and balancing diversity with quality.
Principle
Tokens are sorted by probability, accumulated until the sum just exceeds p; the minimal set achieving this is the nucleus from which sampling occurs. When the distribution is flat, the nucleus expands; when it is sharp, the nucleus contracts.
Limitations
top_p is sensitive to the exact p value; small changes (e.g., 0.9 vs 0.95) can produce markedly different candidate sizes for flat distributions. Moreover, its interaction with temperature is nonlinear: temperature first reshapes the distribution, then top_p truncates the reshaped distribution, making their combined effect hard to predict.
Common Misconceptions
Misconception: temperature and top_p should always be tuned together for optimal results. Joint tuning is difficult and can lead to non‑reproducible debugging; many APIs recommend adjusting only one at a time (e.g., keep top_p at its default 0.9–0.95 while varying temperature, then optionally fine‑tune top_p).
Misconception: top_p=1.0 is equivalent to disabling top_p. Mathematically true, but if top_k is also set, its truncation still applies, so sampling is not truly unrestricted.
frequency_penalty
Role
Penalizes tokens that have already appeared, with the penalty increasing proportionally to the token’s count, helping to break repetitive loops.
Principle
For each token, an additive penalty proportional to its occurrence count is subtracted from its logit (OpenAI implementation):
adjusted_logit = logit - frequency_penalty * count(token)Example: with frequency_penalty=0.2 and a token logit of 100, the second occurrence becomes 99.6, the third 99.4, and so on.
Limitations
The penalty applies to all previously generated tokens, including functional words, punctuation, and necessary terminology. Excessive values can force the model to avoid reasonable repetitions, resulting in unnatural phrasing or factual inaccuracies.
Common Misconceptions
Misconception: frequency_penalty and HuggingFace’s repetition_penalty are the same. OpenAI’s version is additive (logit minus a constant), while HuggingFace’s is multiplicative (logit divided by a factor); they are not mathematically equivalent.
Misconception: raising frequency_penalty alone solves “verbose” outputs. Repetitive output often stems from prompt design rather than the penalty; explicit prompt instructions are usually more effective.
presence_penalty
Role
Applies a fixed one‑time penalty to any token that has appeared at least once, encouraging the model to explore new vocabulary rather than merely suppressing repeats.
Principle
The adjustment is:
adjusted_logit = logit - presence_penalty * [token has appeared]where the indicator is 1 if the token has ever appeared, regardless of frequency.
Limitations
Like frequency_penalty, it indiscriminately penalizes functional words and necessary terms, and because the penalty does not accumulate, it is weaker than frequency_penalty for high‑frequency repeats.
Common Misconceptions
Misconception: frequency_penalty and presence_penalty are interchangeable. Frequency_penalty targets repeated usage with cumulative effect; presence_penalty encourages topic diversity by penalizing any prior occurrence.
Misconception: setting presence_penalty to a negative value is useless. Negative values actually reward previously seen tokens, which can be valuable for maintaining strict format, fixed JSON keys, or consistent terminology in certain applications.
max_tokens
Role
Sets an upper bound on the number of tokens generated in a single request, controlling output length and inference cost.
Principle
A counter increments with each generated token; generation stops immediately when the max_tokens limit is reached, regardless of sentence or logical completeness. The limit does not affect probability calculations.
Limitations
max_tokenscannot guarantee semantic completeness; truncation may cut off sentences or leave structures (e.g., JSON objects) unfinished. For “concise but complete” answers, explicit prompt instructions are preferable.
Common Misconceptions
Misconception: setting max_tokens to N guarantees an output of length N. It is an upper bound; the model may stop earlier upon emitting an <eos> token.
Misconception: token count directly maps to character count. No fixed ratio exists; roughly 1 English token ≈ 4 characters (≈1.3 tokens per word), while Chinese characters map to 1–2 tokens. Code, formulas, and symbols have irregular tokenization.
When configuring max_tokens, leave a 20–30% safety margin and verify actual token usage with a tokenizer tool.
Conclusion
Different frameworks may apply these parameters in varying orders, so identical settings can produce different outputs across inference engines. Understanding each parameter’s mathematical effect, limitations, and common pitfalls enables more predictable and effective LLM generation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineer Programming
In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
