Deep Dive into Core LLM API Parameters
While many newcomers think using an LLM API is as simple as picking a model and feeding a prompt, the real control lies in parameters such as temperature, top‑p, top‑k, max_tokens, penalties, stop, and stream, each of which dramatically influences output quality, length, cost, and behavior.
Many people who use an LLM API for the first time assume the process is trivial: choose a model, drop in a prompt, and wait for the result. In practice, the same query can sometimes produce stable, accurate answers and other times generate divergent, noisy output. The cause is often not the prompt but the API parameters.
1. Global Overview – Parameter Categories
When you break down a single LLM API call, the common parameters fall into several groups:
Category Typical Parameters Purpose
--------------------------------------------------------------------------
Basic model, messages, stream Select model, pass conversation, control response mode
Sampling Control temperature, top_p, top_k, min_p Control randomness and diversity of generation
Length Control max_tokens, max_completion_tokens Limit maximum output length (and cost)
Repetition Penalty presence_penalty, frequency_penalty, repetition_penalty Reduce repeated content, encourage new topics
Stop Condition stop, stop_sequences Define tokens that signal generation should end
Advanced Control logit_bias, seed, user Token‑level bias, deterministic output, user ID
Tool Calling tools, tool_choice, parallel_tool_calls Enable function calling (see later article)
Chain‑of‑Thought thinking, reasoning_effort Control reasoning process (see fourth article)The article focuses first on the most influential parameters: sampling control, output length, repetition control, and basic response behavior.
2. Temperature – How Conservative or Creative the Model Is
temperaturedetermines the "smoothness" of the probability distribution. Lower values (<1) make the model favor high‑probability tokens, resulting in more deterministic and focused output; higher values (>1) encourage the model to explore less probable tokens, producing more varied but potentially less logical text.
The mathematical form is: probs = softmax(logits / temperature) Typical ranges are [0.0, 2.0] with a default of 1.0. Some platforms (e.g., Anthropic Claude, Kimi) limit the range to 0 ~ 1.
When stability and reproducibility are required (e.g., code generation, rule extraction, factual QA), lower temperatures are recommended.
3. Top‑P (Nucleus Sampling) – Selecting a High‑Probability Token Set
top_p(also called nucleus sampling) keeps only the smallest set of tokens whose cumulative probability exceeds p. Unlike temperature, which reshapes the whole distribution, top_p truncates the tail. p = 0.1: keep only the very top tokens. p = 0.5: keep tokens covering 50% of probability mass. p = 0.9: common default, balances creativity and noise. p = 1.0: no truncation.
Typical recommended values per scenario:
Scenario Recommended Top‑P Explanation
----------------------------------------------------------
Code completion 0.1 ~ 0.5 Tight control for correctness
Knowledge QA 0.5 ~ 0.8 Balance accuracy and completeness
Creative writing 0.9 ~ 0.95 Allow more variety
Open‑domain chat 0.9 ~ 1.0 Maximize natural diversity4. Top‑K – Hard Limit on Candidate Count
top_krestricts the model to the K most likely tokens at each step, regardless of their cumulative probability. It acts as a hard rule: only the top K tokens are considered.
K = 1 → greedy decoding (most deterministic, but very rigid).
K = 10~50 → common sweet spot, filters low‑probability noise while keeping flexibility.
Larger K → more candidates, more lively output.
Many implementations apply top_k first, then top_p, and finally temperature, forming three successive “gates”.
5. Platform Support for Top‑K
OpenAI’s API does **not** support top_k; Anthropic Claude, Google Gemini, DeepSeek, and others do. When building a multi‑platform SDK, you cannot assume top_k is universally available.
6. Max Tokens – Controlling Length and Cost
max_tokens(or the newer max_completion_tokens) caps the number of tokens the model may generate. If the limit is reached, the model stops even if the answer is incomplete.
Typical pitfalls:
Too low → truncated answers, incomplete JSON, cut‑off code.
Too high → higher cost, longer latency, possible verbosity.
Recommended ranges per task:
Task Recommended Max Tokens Reason
---------------------------------------------------------------
Short QA / classification 50 ~ 200 Concise answers
Summarization 200 ~ 500 Preserve core information
Code generation 500 ~ 2000 Function‑level or snippet‑level code
Article writing 1000 ~ 4000 Full paragraphs
Long‑form reasoning 4000+ Need large context (e.g., DeepSeek‑R1 ≥4096)7. Presence Penalty & Frequency Penalty – Reducing Repetition
Both parameters aim to curb repetition but work differently: presence_penalty penalizes any token that has already appeared, encouraging the model to introduce new topics. frequency_penalty penalizes tokens proportionally to how many times they have appeared, suppressing repeated phrases.
Typical ranges:
Scenario Presence Penalty Frequency Penalty Explanation
--------------------------------------------------------------------------
Fact QA 0.0 0.0 No penalty needed
Creative writing 0.5 ~ 1.0 0.3 ~ 0.7 Encourage novelty, reduce echo
Long document 0.3 ~ 0.8 0.5 ~ 1.0 Avoid homogenization
Code generation 0.0 0.0 ~ 0.2 Repetition often required in codeRule of thumb: use presence_penalty when you want the model to avoid getting stuck on a single point, and frequency_penalty when you want to prevent the same sentence from being repeated.
8. Stop – Explicit End Tokens
stoptells the model to halt generation as soon as any of the specified strings appears. It is often underestimated but can solve over‑generation, structural drift, or the model “speaking for the user”.
Typical usage examples:
{
"stop": ["
", "。", "Human:", "END"]
}Single‑line answer: ["\n"] Structured output: use a delimiter as the stop token
Dialogue systems: ["Human:", "User:"] to prevent the model from generating the next user turn
9. Stream – Real‑time vs. Batch Output
streamis a boolean (default false). When true, the API returns partial results as they are generated, improving perceived responsiveness for chat, Copilot, or Q&A assistants. When false, the full response is returned at once, which is simpler for batch jobs or offline summarization.
10. Other Useful Parameters
n: number of distinct completions to return (default 1, up to 5+). logit_bias: add a bias to specific token logits, e.g., {"2435": -100} to strongly forbid a token. seed: integer seed for reproducible output. user: user identifier for abuse monitoring. response_format: force JSON output, e.g., {"type": "json_object"}.
11. Decision Framework – From Task to Parameters
Start
├── What kind of task?
│ ├── Precise (code / math / factual QA)
│ │ ├── temperature: 0.0~0.3
│ │ ├── top_p: 0.5~0.8
│ │ ├── top_k: 10~30
│ │ └── penalty: 0.0
│ ├── Balanced (daily chat / translation / summarization)
│ │ ├── temperature: 0.5~0.8
│ │ ├── top_p: 0.8~0.95
│ │ ├── top_k: 30~50
│ │ └── penalty: 0.0~0.3
│ └── Creative (creative writing / brainstorming / poetry)
│ ├── temperature: 0.8~1.5
│ ├── top_p: 0.9~1.0
│ ├── top_k: 50+
│ └── penalty: 0.3~1.0
├── How long should the output be?
│ ├── Short answer → max_tokens: 50~200
│ ├── Medium length → max_tokens: 200~1000
│ └── Long text → max_tokens: 1000+
├── Does the user need instant feedback?
│ ├── Yes → stream: true
│ └── No → stream: false
└── Structured output required?
├── Yes → response_format: {"type": "json_object"}
└── No → default text12. Final Thoughts
The core message is that LLM API parameters are not random switches; they define the model’s operational boundaries. When you encounter unstable answers, length overruns, repetitive text, or style swings, the culprit is often a mis‑tuned parameter rather than the prompt itself.
This article lays the foundation. The next piece will compare how different providers (OpenAI, Anthropic, Gemini, DeepSeek, Kimi, MiniMax, Zero‑One Wanwu) implement these parameters and discuss cross‑language SDK adaptation.
Series Navigation
Article 1 (this one) : Core parameter deep dive
Article 2 : Cross‑platform API parameter comparison and multilingual adapter framework
Article 3 : Sampling parameter tuning best practices for four major scenarios
Article 4 : Chain‑of‑Thought and reasoning control parameters
Article 5 : Function calling and tool‑use parameters in practice
Article 6 : Multimodal parameters and output format control
Article 7 : Streaming responses and performance optimization
Article 8 : LLM API gateway and aggregation architecture
In the next article we will examine platform‑specific support for temperature, max_tokens, and penalty fields, a crucial step for building robust multi‑platform adapters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qborfy AI
A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
