Artificial Intelligence 9 min read

How to Control LLM Output Using Temperature, Top‑K, and Top‑P

The article explains how sampling parameters—Temperature, Top‑k, and Top‑p—shape the output of large language models, comparing greedy and beam search, illustrating probability changes with concrete examples, and offering practical guidance on adjusting these settings for different tasks.

AI Algorithm Path

Mar 4, 2025

How to Control LLM Output Using Temperature, Top‑K, and Top‑P

Large language models (LLMs) generate text by predicting the next token from a probability distribution over the vocabulary. For the partial sentence “The cat sat on the…”, a typical distribution might be:

mat = 0.5 couch = 0.3 roof = 0.1 others = 0.05 …

Greedy Search

Greedy search selects the highest‑probability token at each step. In the example above the model would always choose “mat”. This method is fast and deterministic but can produce repetitive or less diverse text.

Beam Search

Beam search keeps the top k most likely partial sequences (the beam width) at each step, expands each with possible next tokens, and retains the k sequences with the highest combined scores.

With a beam width of 2 the two beams evolve as follows:

Beam 1: “The cat sat on the mat ”. Next‑word candidates: “and” (0.6), “while” (0.3).

Beam 2: “The cat sat on the couch ”. Next‑word candidates: “while” (0.4), “and” (0.3).

Combined probabilities:

The cat sat on the mat and… = 0.7 × 0.6 = 0.42

The cat sat on the couch while… = 0.2 × 0.4 = 0.08

Beam search selects the higher‑scoring sequence, often yielding more coherent results than greedy search.

Sampling

Sampling draws a token according to its probability rather than always picking the top one. Using the same distribution the model might output “couch” or even “roof”, introducing variability.

Temperature

Temperature (T) rescales the raw logits before the softmax, controlling randomness. Typical values are between 0 and 1.

Original probabilities (T = 1): mat = 0.5, couch = 0.3, roof = 0.1

T = 0.2: mat = 0.8, couch = 0.15, roof = 0.05

T = 0.7: mat = 0.55, couch = 0.35, roof = 0.10

Lower temperatures sharpen the distribution (making high‑probability tokens even more likely); higher temperatures flatten it (giving low‑probability tokens a chance).

Top‑P (Nucleus) Sampling

Top‑P selects the smallest set of tokens whose cumulative probability exceeds p, then samples from this subset.

Example probabilities: mat = 0.5, couch = 0.3, roof = 0.1, floor = 0.07, bed = 0.03.

If p = 0.5, only “mat” is kept.

If p = 0.9, “mat”, “couch”, and “roof” remain.

If p = 1.0, all tokens are eligible (full random sampling).

Top‑K Sampling

Top‑K limits consideration to the K highest‑probability tokens at each step, then samples from them.

K = 2 → keep “mat”(0.5) and “couch”(0.3).

K = 3 → keep “mat”(0.5), “couch”(0.3), “roof”(0.1).

K = vocabulary size → full sampling.

These strategies provide different trade‑offs between controllability and creativity.

Code example

如果 Top-p = 0.5 (选择性采样)，此时模型只选择 mat，而忽略了 couch、roof、floor和 bed。
如果 Top-p = 0.9 (更多多样性)，此时模型除了选择 mat外，还可以选择couch、roof，增加了多样性。
如果 Top-p = 1 (完全随机采样)，此时所有候选单词都可以被采样。

LLM Sampling Beam Search Top‑K Top‑P Greedy Search Temperature

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.