How to Explain a Jump from 71% to 94% Tool‑Calling Accuracy in a JD Interview
The article walks through a JD interview scenario where a candidate explains how a tool‑calling accuracy metric rose from 71% to 94% by detailing the full SFT data‑engineering pipeline, teacher‑model trajectory generation, quality validation, evaluation methodology, and interview‑ready talking points.
Why General Models' Tool Calling Is Just Barely Sufficient
Current mainstream large models (GPT‑4o, DeepSeek V3, Qwen3) support function calling out‑of‑the‑box, but their accuracy on complex, multi‑tool tasks—such as a financial query that requires search, web access, and Python computation with strict ordering—typically sits between 65% and 75%.
At 65% accuracy, roughly one out of three tool calls is wrong (wrong tool, malformed parameters, or unnecessary call), which is tolerable in demos but disastrous in production because a single error can derail the entire reasoning chain and produce an apparently complete but factually incorrect report.
In the DeepResearch project, the base Qwen3‑30B‑A3B model achieved 71% tool‑calling accuracy without fine‑tuning. After applying function‑calling SFT (FC‑SFT), the same model reached 94% on the identical test set—a 23‑point gain driven by data engineering rather than a larger model.
Data Engineering Step 1: Where Seed Data Comes From
The key to effective SFT is high‑quality training data, not sheer quantity. The team started with about 200 seed questions sourced from two channels:
Channel 1: Online system logs – Real user queries from early‑stage deployments, providing authentic distribution and difficulty (e.g., a corporate debt‑maturity analysis request that is hard to invent manually).
Channel 2: Expert manual authoring – Domain experts crafted scenarios that appear rarely in logs, such as multi‑hop reasoning, real‑time data retrieval, or cross‑validation research tasks, contributing roughly 40% of the seeds.
To reach a training size of ~1,200 questions, three expansion strategies were applied:
Question rewriting (1.5×) : Use an LLM to generate paraphrases while preserving the underlying tool‑calling pattern.
Parameter variation (2×) : Change concrete parameters like dates, company names, or numeric values to broaden coverage.
Combination expansion (1.5×) : Merge single questions into more complex multi‑step tasks (e.g., “search company A’s revenue then compare with company B”).
Applying all three strategies grew the dataset from 200 to roughly 1,200 training questions.
Data Engineering Step 2: Teacher Model Generates Trajectories
With 1,200 questions ready, the next step is to generate full tool‑calling trajectories for each question: Thought → Action → Observation → … → Final Answer.
The team used DeepSeek V3 as the teacher model because its function‑calling capability is among the best of open‑source models and its cost is manageable.
The teacher model’s <system_prompt> explicitly defines the trajectory format, for example:
<think>
[reasoning: analyze task, decide which tool to call and why]
</think>
<tool_call>
{
"name": "search",
"arguments": {"query": "specific search term", "max_results": 5}
}
</tool_call>
<observation>
[tool return]
</observation>
<think>
[further reasoning based on observation]
</think>
...
<final_answer>
[final synthesized answer]
</final_answer>Design considerations: <think> tags make the reasoning process explicit, enabling the fine‑tuned model to learn not only *what* tool to call but *why*.
Tool calls are expressed in strict JSON for easy validation.
Alternating <observation> and <think> forces the model to “look at the result then think”, rather than generating the entire chain in one shot.
Quality Validation: Not All Teacher‑Generated Trajectories Are Usable
Out of 1,200 generated trajectories, about 30% were filtered out through a three‑tier validation process.
Format validation (automatic) : JSON parsability, tool name existence, required parameters present.
Logic validation (automatic) : Reasonable step count (2–15 steps), final answer supported by tool calls, no redundant repeated calls.
Human spot‑check (10% sample) : Verify that the reasoning is sensible and not a round‑about path that would teach inefficient habits.
After filtering, ~840 high‑quality trajectories remained. A common pitfall highlighted is **data distribution imbalance**—if 70% of the data involve search tools and only 10% involve Python execution, the fine‑tuned model will over‑prefer search even when Python is appropriate. The solution is to balance tool types during seed construction.
Claude Code's ExtractionCoordinator: Another Automatic Knowledge Accumulation
The article draws a parallel with Claude Code’s ExtractionCoordinator, which accumulates useful memories from conversations. After each dialogue, an asynchronous task checks whether at least four new messages have been exchanged ( MIN_NEW_MESSAGES = 4) before extracting and persisting valuable information to MEMORY.md or a topic‑specific memory file.
# Simplified Claude Code logic
class ExtractionCoordinator:
MIN_NEW_MESSAGES = 4
async def maybe_extract(self, session):
new_messages = session.messages_since_last_extraction
if len(new_messages) < self.MIN_NEW_MESSAGES:
return
if self._is_running:
self._dirty = True
return
await self._run_extraction(session)
async def _run_extraction(self, session):
self._is_running = True
try:
memories = await self.model.extract_memories(session.recent_messages)
if memories:
self.memory_writer.update(memories)
finally:
self._is_running = False
if self._dirty:
self._dirty = False
await self._run_extraction(session)This design mirrors the SFT data‑quality gate: only information that passes a threshold is retained, improving overall system value.
After Fine‑Tuning: How to Verify the 23‑Point Gain
Evaluation follows three core principles: no overlap between training and test sets, test scenarios must cover unseen distributions, and multiple metrics beyond raw accuracy are measured.
Layer 1 – Standard test set (100 items) : Holds‑out seed data to ensure basic capability is retained.
Layer 2 – Out‑of‑distribution test (50 items) : Includes rare or complex multi‑tool tasks and error‑handling cases to assess generalization.
Layer 3 – Adversarial test (30 items) : Deliberately crafted “should‑not‑call‑tool” questions (e.g., “2 + 2 = ?”) to detect over‑fitting to a “call‑tool‑for‑everything” bias.
Additional dimensions evaluated are tool‑selection accuracy, parameter‑format accuracy, call‑order accuracy, and **forgetting rate** (performance drop on pure QA tasks). To mitigate forgetting, 30% of the training data are generic instruction samples; this proportion was empirically determined.
Results: FC‑SFT with 5,000 samples raised tool‑calling accuracy from 71% to 94% while MMLU (a general QA benchmark) dropped only 0.3%, which is statistically insignificant.
How to Answer “SFT Fine‑Tuning” in an Interview
The interview expects a clear articulation of the data‑engineering pipeline rather than a textbook definition of SFT.
Layer 1 – Problem statement (≈20 s) : Explain that generic models achieve only 65‑75% accuracy on complex tool‑calling tasks and that FC‑SFT aims to reach production‑grade performance without changing the base model.
Layer 2 – Process (≈90 s) : Describe the four steps – seed question construction (online logs + expert authoring, ~200 items), expansion to ~1,200 items, teacher‑model trajectory generation, and quality filtering plus distribution control (including 30% generic data to prevent forgetting).
Layer 3 – Evaluation (≈30 s) : Outline the three‑layer test set, the four evaluation metrics, and the final numbers (71% → 94% tool‑calling accuracy, negligible MMLU degradation).
If pressed on the 30% generic data ratio, note that it is an empirically‑derived safeguard against catastrophic forgetting of non‑tool‑calling abilities.
Summary
FC‑SFT’s essence is turning a model from “just enough” to “production ready” through rigorous data engineering, not magic hyper‑parameter tricks.
Data quality outweighs quantity: 840 high‑quality trajectories outperform thousands of noisy examples.
Distribution balance and forgetting protection are hidden control variables; neglecting them skews model behavior.
Closed‑loop evaluation—test sets and forgetting monitoring—is indispensable for trustworthy improvements.
Claude Code’s ExtractionCoordinator embodies the same philosophy: selective, threshold‑based knowledge accumulation beats indiscriminate data piling.
Both systems demonstrate that thoughtful selection is more valuable than sheer volume.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
