Deep Dive into Core LLM API Parameters

While many newcomers think using an LLM API is as simple as picking a model and feeding a prompt, the real control lies in parameters such as temperature, top‑p, top‑k, max_tokens, penalties, stop, and stream, each of which dramatically influences output quality, length, cost, and behavior.

Qborfy AI
Qborfy AI
Qborfy AI
Deep Dive into Core LLM API Parameters

Many people who use an LLM API for the first time assume the process is trivial: choose a model, drop in a prompt, and wait for the result. In practice, the same query can sometimes produce stable, accurate answers and other times generate divergent, noisy output. The cause is often not the prompt but the API parameters.

1. Global Overview – Parameter Categories

When you break down a single LLM API call, the common parameters fall into several groups:

Category          Typical Parameters                         Purpose
--------------------------------------------------------------------------
Basic             model, messages, stream                    Select model, pass conversation, control response mode
Sampling Control  temperature, top_p, top_k, min_p           Control randomness and diversity of generation
Length Control    max_tokens, max_completion_tokens         Limit maximum output length (and cost)
Repetition Penalty presence_penalty, frequency_penalty, repetition_penalty  Reduce repeated content, encourage new topics
Stop Condition    stop, stop_sequences                       Define tokens that signal generation should end
Advanced Control  logit_bias, seed, user                     Token‑level bias, deterministic output, user ID
Tool Calling      tools, tool_choice, parallel_tool_calls   Enable function calling (see later article)
Chain‑of‑Thought  thinking, reasoning_effort                Control reasoning process (see fourth article)

The article focuses first on the most influential parameters: sampling control, output length, repetition control, and basic response behavior.

2. Temperature – How Conservative or Creative the Model Is

temperature

determines the "smoothness" of the probability distribution. Lower values (<1) make the model favor high‑probability tokens, resulting in more deterministic and focused output; higher values (>1) encourage the model to explore less probable tokens, producing more varied but potentially less logical text.

The mathematical form is: probs = softmax(logits / temperature) Typical ranges are [0.0, 2.0] with a default of 1.0. Some platforms (e.g., Anthropic Claude, Kimi) limit the range to 0 ~ 1.

When stability and reproducibility are required (e.g., code generation, rule extraction, factual QA), lower temperatures are recommended.

3. Top‑P (Nucleus Sampling) – Selecting a High‑Probability Token Set

top_p

(also called nucleus sampling) keeps only the smallest set of tokens whose cumulative probability exceeds p. Unlike temperature, which reshapes the whole distribution, top_p truncates the tail. p = 0.1: keep only the very top tokens. p = 0.5: keep tokens covering 50% of probability mass. p = 0.9: common default, balances creativity and noise. p = 1.0: no truncation.

Typical recommended values per scenario:

Scenario          Recommended Top‑P   Explanation
----------------------------------------------------------
Code completion    0.1 ~ 0.5           Tight control for correctness
Knowledge QA      0.5 ~ 0.8           Balance accuracy and completeness
Creative writing   0.9 ~ 0.95          Allow more variety
Open‑domain chat   0.9 ~ 1.0           Maximize natural diversity

4. Top‑K – Hard Limit on Candidate Count

top_k

restricts the model to the K most likely tokens at each step, regardless of their cumulative probability. It acts as a hard rule: only the top K tokens are considered.

K = 1 → greedy decoding (most deterministic, but very rigid).

K = 10~50 → common sweet spot, filters low‑probability noise while keeping flexibility.

Larger K → more candidates, more lively output.

Many implementations apply top_k first, then top_p, and finally temperature, forming three successive “gates”.

5. Platform Support for Top‑K

OpenAI’s API does **not** support top_k; Anthropic Claude, Google Gemini, DeepSeek, and others do. When building a multi‑platform SDK, you cannot assume top_k is universally available.

6. Max Tokens – Controlling Length and Cost

max_tokens

(or the newer max_completion_tokens) caps the number of tokens the model may generate. If the limit is reached, the model stops even if the answer is incomplete.

Typical pitfalls:

Too low → truncated answers, incomplete JSON, cut‑off code.

Too high → higher cost, longer latency, possible verbosity.

Recommended ranges per task:

Task                     Recommended Max Tokens   Reason
---------------------------------------------------------------
Short QA / classification   50 ~ 200               Concise answers
Summarization               200 ~ 500              Preserve core information
Code generation              500 ~ 2000            Function‑level or snippet‑level code
Article writing               1000 ~ 4000           Full paragraphs
Long‑form reasoning          4000+                 Need large context (e.g., DeepSeek‑R1 ≥4096)

7. Presence Penalty & Frequency Penalty – Reducing Repetition

Both parameters aim to curb repetition but work differently: presence_penalty penalizes any token that has already appeared, encouraging the model to introduce new topics. frequency_penalty penalizes tokens proportionally to how many times they have appeared, suppressing repeated phrases.

Typical ranges:

Scenario          Presence Penalty   Frequency Penalty   Explanation
--------------------------------------------------------------------------
Fact QA            0.0                0.0                 No penalty needed
Creative writing   0.5 ~ 1.0          0.3 ~ 0.7           Encourage novelty, reduce echo
Long document      0.3 ~ 0.8          0.5 ~ 1.0           Avoid homogenization
Code generation    0.0                0.0 ~ 0.2           Repetition often required in code

Rule of thumb: use presence_penalty when you want the model to avoid getting stuck on a single point, and frequency_penalty when you want to prevent the same sentence from being repeated.

8. Stop – Explicit End Tokens

stop

tells the model to halt generation as soon as any of the specified strings appears. It is often underestimated but can solve over‑generation, structural drift, or the model “speaking for the user”.

Typical usage examples:

{
  "stop": ["
", "。", "Human:", "END"]
}

Single‑line answer: ["\n"] Structured output: use a delimiter as the stop token

Dialogue systems: ["Human:", "User:"] to prevent the model from generating the next user turn

9. Stream – Real‑time vs. Batch Output

stream

is a boolean (default false). When true, the API returns partial results as they are generated, improving perceived responsiveness for chat, Copilot, or Q&A assistants. When false, the full response is returned at once, which is simpler for batch jobs or offline summarization.

10. Other Useful Parameters

n

: number of distinct completions to return (default 1, up to 5+). logit_bias: add a bias to specific token logits, e.g., {"2435": -100} to strongly forbid a token. seed: integer seed for reproducible output. user: user identifier for abuse monitoring. response_format: force JSON output, e.g., {"type": "json_object"}.

11. Decision Framework – From Task to Parameters

Start
├── What kind of task?
│   ├── Precise (code / math / factual QA)
│   │   ├── temperature: 0.0~0.3
│   │   ├── top_p: 0.5~0.8
│   │   ├── top_k: 10~30
│   │   └── penalty: 0.0
│   ├── Balanced (daily chat / translation / summarization)
│   │   ├── temperature: 0.5~0.8
│   │   ├── top_p: 0.8~0.95
│   │   ├── top_k: 30~50
│   │   └── penalty: 0.0~0.3
│   └── Creative (creative writing / brainstorming / poetry)
│       ├── temperature: 0.8~1.5
│       ├── top_p: 0.9~1.0
│       ├── top_k: 50+
│       └── penalty: 0.3~1.0
├── How long should the output be?
│   ├── Short answer → max_tokens: 50~200
│   ├── Medium length → max_tokens: 200~1000
│   └── Long text → max_tokens: 1000+
├── Does the user need instant feedback?
│   ├── Yes → stream: true
│   └── No → stream: false
└── Structured output required?
    ├── Yes → response_format: {"type": "json_object"}
    └── No → default text

12. Final Thoughts

The core message is that LLM API parameters are not random switches; they define the model’s operational boundaries. When you encounter unstable answers, length overruns, repetitive text, or style swings, the culprit is often a mis‑tuned parameter rather than the prompt itself.

This article lays the foundation. The next piece will compare how different providers (OpenAI, Anthropic, Gemini, DeepSeek, Kimi, MiniMax, Zero‑One Wanwu) implement these parameters and discuss cross‑language SDK adaptation.

Series Navigation

Article 1 (this one) : Core parameter deep dive

Article 2 : Cross‑platform API parameter comparison and multilingual adapter framework

Article 3 : Sampling parameter tuning best practices for four major scenarios

Article 4 : Chain‑of‑Thought and reasoning control parameters

Article 5 : Function calling and tool‑use parameters in practice

Article 6 : Multimodal parameters and output format control

Article 7 : Streaming responses and performance optimization

Article 8 : LLM API gateway and aggregation architecture

In the next article we will examine platform‑specific support for temperature, max_tokens, and penalty fields, a crucial step for building robust multi‑platform adapters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMPrompt EngineeringAPIsamplingparameter tuningtop_ptemperaturemax_tokens
Qborfy AI
Written by

Qborfy AI

A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.