Boost LLM Originality: Master Temperature Scaling & Top‑K Sampling
This tutorial revisits a simple text‑generation function, explains how temperature scaling and top‑K sampling reshape token probability distributions, demonstrates their effects with PyTorch code and visualizations, and shows how to integrate both techniques into an improved generation routine for more diverse and human‑like outputs.
In the previous article we implemented generate_text_simple to generate text. Here we revisit that function and show its output.
model.to("cpu")
model.eval()
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=25,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:
", token_ids_to_text(token_ids, tokenizer))Output result:
We will improve the model's output quality to make it more human‑like.
How to make large models generate more original text
Temperature scaling
Background: language model output
LLM predicts the next token by outputting a logits vector of size V (vocabulary size). Softmax converts logits to a probability distribution.
Introducing temperature parameter T
When T = 1: standard softmax.
When T < 1: distribution becomes sharper (higher probability for top tokens, model more conservative).
When T > 1: distribution becomes smoother (more diversity, lower confidence).
Temperature scaling adjusts the scaling factor of logits in softmax.
In training (distillation) a high temperature smooths the distribution to help a student model learn.
In inference, adjusting temperature controls determinism versus diversity of generated text.
Explanation with code
We prepare a small vocabulary of nine tokens for easy tracing.
vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}Create an inverse mapping to convert token IDs back to words.
inverse_vocab = {v: k for k, v in vocab.items()}Assume the initial context "Every effort moves you" and generate logits for the next token.
next_token_logits = torch.tensor(
[4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)
probas = torch.softmax(next_token_logits, dim=0)
next_token_id = torch.argmax(probas).item()
print(inverse_vocab[next_token_id])The result is "forward". Using torch.multinomial with the same probabilities still selects "forward" because it has the highest probability.
Multinomial samples according to token probabilities, so other tokens can be chosen occasionally. Repeating sampling 1000 times shows "forward" appears most often, while "closer", "inches", and "toward" appear less frequently.
By adjusting temperature we can control the distribution. The function below applies temperature scaling.
def softmax_with_temperature(logits, temperature):
scaled_logits = logits / temperature
return torch.softmax(scaled_logits, dim=0)
temperatures = [1, 0.1, 5] # original, higher confidence, lower confidence
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]When temperature is very low (0.1) the distribution becomes sharp and multinomial almost always picks the most likely token ("forward"). When temperature is high (5) the distribution flattens, increasing diversity but also the risk of nonsensical tokens such as "pizza".
Top‑K sampling
Background: training vs. inference
During training the model learns token distributions via maximum likelihood (cross‑entropy) with teacher forcing. At inference we must sample a token from the predicted distribution using strategies like Top‑K, Top‑P (nucleus), and temperature scaling.
How Top‑K works
Sort probabilities descending.
Keep only the top K most likely tokens.
Renormalize their probabilities.
Sample one token from this reduced set.
This avoids extremely low‑probability tokens while preserving some diversity.
Explanation with code
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
print("Top logits:", top_logits)
print("Top positions:", top_pos)
new_logits = torch.where(
next_token_logits < top_logits[-1],
torch.tensor(float('-inf')),
next_token_logits
)
topk_probas = torch.softmax(new_logits, dim=0)
print(topk_probas)The resulting probabilities contain three non‑zero values, from which we can apply temperature scaling and multinomial sampling.
Updating the text generation function
We combine temperature scaling and Top‑K into the generate function.
def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
for _ in range(max_new_tokens):
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :]
if top_k is not None:
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(
logits < min_val,
torch.tensor(float("-inf")).to(logits.device),
logits
)
if temperature > 0.0:
logits = logits / temperature
probs = torch.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
else:
idx_next = torch.argmax(logits, dim=-1, keepdim=True)
if idx_next == eos_id:
break
idx = torch.cat((idx, idx_next), dim=1)
return idxRunning this updated function with top_k=25 and temperature=1.4 yields more varied output compared to the original greedy generation.
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=15,
context_size=GPT_CONFIG_124M["context_length"],
top_k=25,
temperature=1.4
)
print("Output text:
", token_ids_to_text(token_ids, tokenizer))The result demonstrates that different sampling strategies affect the large model's output, producing more diverse and potentially more original sentences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
