Artificial Intelligence 9 min read

Decoding Strategies for Generative Models: Top‑k, Top‑p, Contrastive Search, Beam Search, and Sampling

The article explains how generative models use deterministic methods like greedy and beam search and stochastic techniques such as top‑k, top‑p, contrastive search and sampling, describing their mechanisms, temperature control, repetition penalties, and practical trade‑offs for balancing fluency, diversity and coherence.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Decoding Strategies for Generative Models: Top‑k, Top‑p, Contrastive Search, Beam Search, and Sampling

Generative models use two main categories of decoding methods: deterministic (e.g., greedy search and beam search) and stochastic (e.g., sampling, top‑k, top‑p, contrastive search). Deterministic methods often produce less natural text, while stochastic methods introduce randomness to improve diversity and fluency.

Top‑k sampling : At each decoding step the model keeps the k highest‑probability tokens and randomly selects one of them as the next token.

Top‑p (nucleus) sampling : The model sorts tokens by probability, accumulates them until the cumulative probability exceeds a threshold p , and then samples from this dynamic set.

Temperature controls randomness: a higher temperature yields a flatter distribution and more diverse output, while a lower temperature makes the distribution sharper and more deterministic.

Contrastive search : Combines model confidence with a degeneration penalty based on cosine similarity between the candidate token and previous context tokens. The penalty discourages repeats; when the penalty weight α is zero, contrastive search reduces to greedy decoding.

Code example for contrastive search:

output = model.generate(
    input_ids,
    penalty_alpha=0.6,  # α in contrastive search
    top_k=4,            # k in contrastive search
    max_length=512
)

Beam search : Keeps the num_beams most likely tokens at each step, expands them, and finally selects the highest‑probability sequence. It mitigates the risk of missing high‑probability sequences but can still produce repeated fragments.

An n‑gram repetition penalty can be applied to beam search to prevent duplicate n‑grams:

beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2, # prevent repeated 2‑grams
    early_stopping=True
)

Sampling (do_sample=True) : Makes generation nondeterministic. Lowering the temperature makes the distribution sharper; setting temperature to 0 collapses sampling back to greedy decoding, inheriting its repetition issues.

Example of activating sampling without top‑k:

sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0
)

Example with temperature control:

sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0,
    temperature=0.7
)

Combining top‑k and top‑p (and returning multiple sequences):

sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3
)

The article discusses practical trade‑offs: choosing appropriate decoding methods, randomness parameters, and temperature values based on the task and desired output characteristics. It also cites research indicating that high‑quality human language does not strictly follow maximum‑probability rules, highlighting the importance of incorporating randomness and creativity into generation.

AIsamplingbeam searchtext generationcontrastive searchdecoding methodstop-ktop-p
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.