How Large Language Models Acquire Tool‑Calling Ability: SFT, RLHF & LoRA Explained
The article explains why pretrained LLMs cannot call tools, then breaks down the three‑stage training pipeline—Supervised Fine‑Tuning, Reinforcement Learning from Human Feedback, and knowledge distillation—showing how each step teaches models to read tool schemas, decide when to invoke a tool, generate JSON calls, and finally transfer the capability to smaller models with LoRA.
Problem Origin
During pre‑training the model only predicts the next token from pure text. It never sees a pattern like “output a JSON to trigger an API call”, so it cannot generate tool‑call JSON.
Analogy: reading cooking books without ever stirring a pan.
Three‑Stage Training Logic
The ability is built by three stages:
SFT (Supervised Fine‑Tuning) – learn the full tool‑calling chain (read description, decide to call, output JSON, integrate result).
RLHF (Reinforcement Learning from Human Feedback) – learn when calling is appropriate by assigning reward scores.
Distillation – transfer the learned behavior from a large teacher model to a smaller student model.
SFT: From Text Prediction to Logical Fill‑in
Training samples have five parts:
Part 1: System message – list of tools with name, description, JSON schema
Part 2: User question (e.g., “Check if it will rain in Shenzhen tomorrow”)
Part 3: Model‑generated tool‑call JSON
Part 4: Simulated tool execution result
Part 5: Final natural‑language answerHundreds of thousands of such examples raise the probability of emitting {"tool_calls": …} when the system prompt contains a tool schema.
Key point: The model treats the JSON schema as a fill‑in template and the user query as the stem.
Because most SFT examples are positive calls, the model tends to over‑call tools (e.g., answering “3×7” by invoking a calculator). This blind spot is corrected in the RLHF stage.
RLHF: Building “Tool‑Calling Cost” Awareness
RLHF introduces a value function that penalizes unnecessary calls.
Scenario “1+1”, model calls calculator → reward –5 (wastes tokens, slow).
Scenario “1+1”, model replies “2” → reward +5 (correct and efficient).
Scenario “What should I wear in Shenzhen tomorrow?”, model calls weather API → reward +10 (necessary and accurate).
RLHF four‑step loop:
Step 1: Sample multiple responses (direct answer, tool call, wrong parameters)
Step 2: Human (or AI) rank them (best → worst)
Step 3: Train a Reward Model to predict the ranking
Step 4: Optimize the main model with PPO using the reward signalIndustry trend: replace human ranking with AI feedback (RLAIF).
Runtime Mechanism: Decoupling Decision and Execution
At inference time the model only decides and emits a tool_calls JSON. The surrounding application parses the JSON, performs the HTTP request or code execution, and feeds the result back to the model.
Closed‑Loop Flow
Model interprets intent and generates tool_calls JSON.
Application parses JSON, sends the request, obtains raw tool result.
Model receives the result and produces the final natural‑language answer.
Code Illustration
# Pseudo‑code for a function‑calling loop
import json
# First turn: send tool definitions + user query
response = llm.chat(
messages=[{"role": "user", "content": "Help me check Hangzhou weather tomorrow"}],
tools=weather_tool_schema
)
# Model decides to call a tool
if response.finish_reason == "tool_calls":
call = response.tool_calls[0]
func_name = call.name # "get_weather"
func_args = json.loads(call.arguments) # {"city": "Hangzhou", "date": "tomorrow"}
# Execute the tool in your code
result = execute_tool(func_name, func_args)
# Second turn: feed execution result back
messages.append(response.message)
messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
final = llm.chat(messages=messages, tools=weather_tool_schema)
print(final.content) # "Hangzhou tomorrow will be cloudy, 18‑25°C, suitable for travel."Core insight: The model never accesses the network; execute_tool performs the real work.
LoRA – Efficient Fine‑Tuning Technique
LoRA (Low‑Rank Adaptation) adds a small trainable rank‑decomposed update ΔW = A × B to the frozen large weight matrix W. For a 1000×1000 matrix, LoRA reduces parameters from 1 000 000 to roughly 16 000 (>60× reduction), enabling fine‑tuning on a laptop GPU.
LoRA is commonly used in the SFT stage to teach tool calling, and can also be applied to reward‑model training in RLHF.
Knowledge Distillation – Transferring Ability to Small Models
Large models (e.g., GPT‑4) have strong tool‑calling but are costly. Distillation creates a 7B‑ or 13B‑size student that inherits the ability.
Three Distillation Methods
Response‑based: Generate {question, teacher answer (including tool_calls JSON)} pairs with a strong model, then fine‑tune the student on these pairs. Simple but the student only learns final outputs.
Logit‑based: Align the student’s token‑level probabilities with the teacher’s (e.g., teacher P(call)=0.85 vs student P(call)=0.72). Captures “dark knowledge” but requires teacher logits.
Feature‑based: Align hidden states or attention patterns between teacher and student (e.g., teacher layer‑20 vs student layer‑10). Best performance, higher implementation complexity.
Practical Distillation Pipeline
Step 1: Collect many real or synthetic user queries
Step 2: Use a teacher model (e.g., GPT‑4) to generate full answers (judgment, tool_calls JSON, final answer)
Step 3: Filter low‑quality samples (format errors, hallucinations)
Step 4: Fine‑tune the student with SFT on the cleaned data
Step 5 (optional): Apply RLHF on the student for further refinement
Result: Small model that calls tools correctly with >10× lower inference costDistillation differs from SFT in data source (teacher‑generated vs human‑annotated) and in what is learned (teacher’s reasoning and boundary awareness vs static patterns).
Complete Cognitive Chain
Training chain:
Pre‑training – no tool ability.
SFT – learn to output JSON.
RLHF – learn when to call.
Distillation – transfer to small model.
Runtime – model decides, application executes.
Common pitfalls:
Confusing SFT (teaches full calling pattern) with RLHF (teaches judgment).
Assuming the model executes the tool; it only emits JSON.
Unclear provenance of training data – core seed data are human‑annotated; large‑scale data come from strong‑model generation plus human filtering.
Mixing up distillation with SFT – distillation transfers teacher behavior, not just static answers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
CodeTrend
Capture the daily pulse of global open-source tech. Real-time tracking of GitHub Trending and curated selections of the hottest projects worldwide, including C++, Python and other verticals. Avoid information overload and keep tech trends within reach.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
