How to Build Robust Function Call Training Data for LLM Agents

This article explains why function call capabilities in large language model agents require dedicated training, outlines the four core abilities to teach, describes the structure and sources of effective training data, and compares lightweight LoRA fine‑tuning with full supervised fine‑tuning approaches.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
How to Build Robust Function Call Training Data for LLM Agents

Why Function Call Needs Training

General LLMs do not inherently understand a specific business toolset; without tailored data they may select wrong tools, miss arguments, produce malformed JSON, or ignore tool calls altogether. Training data is therefore essential to teach the model the correct tool‑selection logic and usage patterns.

Goals of the Training Data

A practical function‑call dataset should teach the model four abilities:

Tool Selection : decide when and which tool to invoke.

Argument Filling : correctly extract parameters such as dates, locations, numbers, and implicit user preferences.

JSON/Schema Formatting : generate well‑formed JSON that matches the tool schema.

Multi‑turn Context : use conversation history to fill missing arguments or clarify ambiguous requests.

What the Training Samples Look Like

Each sample typically contains three parts: a positive example, a negative example, and a hard case.

1️⃣ Positive Example

User: 帮我查一下明天从上海到北京的机票。
Assistant:
{
  "tool": "search_flights",
  "args": {
    "origin": "上海",
    "destination": "北京",
    "date": "2023-10-30"
  }
}

This shows the correct tool and argument extraction for a common query.

2️⃣ Negative Example

When the user asks a non‑tool question, the model should respond without a tool call.

Assistant: (不给工具调用)
北京未来三天的天气较冷…

Or when the request is ambiguous, the model should ask for clarification instead of calling a tool.

3️⃣ Hard Cases

Hard cases cover chain‑of‑thought reasoning, multi‑turn memory, ambiguous expressions, and frequent JSON errors.

User: 帮我查一下后天魔都飞帝都的机票。
Assistant (tool_call):
{
  "tool": "search_flights",
  "args": {
    "origin": "上海",
    "destination": "北京",
    "date": "2023-10-31"
  }
}

These examples force the model to handle slang, fuzzy dates, and correct schema generation.

Data Sources (Four Types)

Synthetic Data : generate large volumes using templates or LLM‑assisted generation; dominates low‑resource scenarios (≈70% of data).

Real Dialogue Logs : harvested from deployed agents; provide authentic user expressions, implicit intents, and abundant ambiguities.

Rule‑Based Hard Cases : crafted to stress boundary conditions such as cross‑month dates, misspellings, missing arguments, and schema edge cases.

Error‑Driven Bad Cases : each time a tool call fails, the failure is turned into a corrective training example, creating a closed‑loop improvement process.

Dataset Scale and Ratio

A typical composition is 60% positive, 30% hard cases, and 10% negative examples. Size depends on tool count and business complexity:

Simple use‑cases: a few thousand examples.

Medium complexity (e.g., travel, e‑commerce): 10 k–50 k examples.

Enterprise‑grade multi‑turn agents: >100 k examples.

Training Approaches

① Light‑weight Fine‑Tuning (LoRA / Adapters)

Fine‑tune only the tool‑call head. Fast, low‑cost, and stable for projects with few tools or frequent iteration cycles.

② Full Supervised Fine‑Tuning + Preference Optimization

Use full SFT to teach correct behavior, then apply DPO/ORPO to penalize bad tool calls. Suitable for agents with many tools, complex arguments, or strict hallucination avoidance requirements.

Both pipelines typically start with the curated dataset described above, iteratively augmenting it with newly discovered bad cases.

Key Takeaways for Interviews

When asked about function‑call training, mention the need for a systematic dataset covering positive, negative, and hard cases; cite the four abilities the model must learn; discuss data sources (synthetic, real logs, rule‑based, error‑driven); and explain the choice between LoRA adapters and full SFT + DPO based on tool count and risk tolerance.

Data GenerationFine-tuningLLM trainingagent systems
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.