Boost LLM Originality: Master Temperature Scaling & Top‑K Sampling

This tutorial revisits a simple text‑generation function, explains how temperature scaling and top‑K sampling reshape token probability distributions, demonstrates their effects with PyTorch code and visualizations, and shows how to integrate both techniques into an improved generation routine for more diverse and human‑like outputs.

Instant Consumer Technology Team
Instant Consumer Technology Team
Instant Consumer Technology Team
Boost LLM Originality: Master Temperature Scaling & Top‑K Sampling

In the previous article we implemented generate_text_simple to generate text. Here we revisit that function and show its output.

model.to("cpu")
model.eval()

tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:
", token_ids_to_text(token_ids, tokenizer))

Output result:

We will improve the model's output quality to make it more human‑like.

How to make large models generate more original text

Temperature scaling

Background: language model output

LLM predicts the next token by outputting a logits vector of size V (vocabulary size). Softmax converts logits to a probability distribution.

Introducing temperature parameter T

When T = 1: standard softmax.

When T < 1: distribution becomes sharper (higher probability for top tokens, model more conservative).

When T > 1: distribution becomes smoother (more diversity, lower confidence).

Temperature scaling adjusts the scaling factor of logits in softmax.

In training (distillation) a high temperature smooths the distribution to help a student model learn.

In inference, adjusting temperature controls determinism versus diversity of generated text.

Explanation with code

We prepare a small vocabulary of nine tokens for easy tracing.

vocab = {
    "closer": 0,
    "every": 1,
    "effort": 2,
    "forward": 3,
    "inches": 4,
    "moves": 5,
    "pizza": 6,
    "toward": 7,
    "you": 8,
}

Create an inverse mapping to convert token IDs back to words.

inverse_vocab = {v: k for k, v in vocab.items()}

Assume the initial context "Every effort moves you" and generate logits for the next token.

next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)
probas = torch.softmax(next_token_logits, dim=0)
next_token_id = torch.argmax(probas).item()
print(inverse_vocab[next_token_id])

The result is "forward". Using torch.multinomial with the same probabilities still selects "forward" because it has the highest probability.

Multinomial samples according to token probabilities, so other tokens can be chosen occasionally. Repeating sampling 1000 times shows "forward" appears most often, while "closer", "inches", and "toward" appear less frequently.

By adjusting temperature we can control the distribution. The function below applies temperature scaling.

def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)

temperatures = [1, 0.1, 5]  # original, higher confidence, lower confidence
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]

When temperature is very low (0.1) the distribution becomes sharp and multinomial almost always picks the most likely token ("forward"). When temperature is high (5) the distribution flattens, increasing diversity but also the risk of nonsensical tokens such as "pizza".

Top‑K sampling

Background: training vs. inference

During training the model learns token distributions via maximum likelihood (cross‑entropy) with teacher forcing. At inference we must sample a token from the predicted distribution using strategies like Top‑K, Top‑P (nucleus), and temperature scaling.

How Top‑K works

Sort probabilities descending.

Keep only the top K most likely tokens.

Renormalize their probabilities.

Sample one token from this reduced set.

This avoids extremely low‑probability tokens while preserving some diversity.

Explanation with code

top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
print("Top logits:", top_logits)
print("Top positions:", top_pos)

new_logits = torch.where(
    next_token_logits < top_logits[-1],
    torch.tensor(float('-inf')),
    next_token_logits
)

topk_probas = torch.softmax(new_logits, dim=0)
print(topk_probas)

The resulting probabilities contain three non‑zero values, from which we can apply temperature scaling and multinomial sampling.

Updating the text generation function

We combine temperature scaling and Top‑K into the generate function.

def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]

        if top_k is not None:
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(
                logits < min_val,
                torch.tensor(float("-inf")).to(logits.device),
                logits
            )

        if temperature > 0.0:
            logits = logits / temperature
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)

        if idx_next == eos_id:
            break
        idx = torch.cat((idx, idx_next), dim=1)
    return idx

Running this updated function with top_k=25 and temperature=1.4 yields more varied output compared to the original greedy generation.

torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=15,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=25,
    temperature=1.4
)
print("Output text:
", token_ids_to_text(token_ids, tokenizer))

The result demonstrates that different sampling strategies affect the large model's output, producing more diverse and potentially more original sentences.

LLMPyTorchtext generationtemperature scalingtop-k sampling
Instant Consumer Technology Team
Written by

Instant Consumer Technology Team

Instant Consumer Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.