Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation
This article shares practical experience on deploying private Agentic AI, covering background, architecture design, challenges, data generation, reinforcement learning with DPO, automated multi‑dimensional evaluation, and future plans for open‑source models and richer tool integration.
Background
Agentic AI differs from Chat or Workflow AI by requiring autonomous path planning, tool usage, and self‑reflection. It must independently generate solutions, call external functions (e.g., search engines, knowledge bases), and evaluate whether retrieved information is sufficient, deciding to continue searching, stop, or write results to files.
Challenges of Private Deployment
Three major challenges arise when deploying Agentic AI privately: (1) high task complexity demanding many tool calls; (2) diverse tool requirements across scenarios, which large open‑source models struggle to handle; (3) substantial resource consumption due to long context windows and high GPU memory needs.
Overall Architecture Design
The system generates queries, creates diverse resources, runs an agent framework to obtain prompt‑response pairs, and applies end‑to‑end scoring to filter high‑quality data. The filtered data feed supervised fine‑tuning (SFT) and subsequently reinforcement learning with Direct Preference Optimization (DPO) using a process‑level compare method.
Data Generation for SFT
Ensure tool selection correctness, parameter usage, and proper hand‑off between agents.
Guarantee accurate problem‑solving paths.
Promote diversity in queries across domains, topics, and output formats.
Automated Multi‑Dimensional Evaluation
Inspired by WritingBench, the evaluation generates multiple dimensions and scoring standards for each response, then aggregates scores. Few‑shot prompting improves alignment with human scores, while multi‑dimensional methods achieve the smallest gap between model and human evaluations.
DPO Data Synthesis
For each query, the model generates several responses. Each response undergoes rule‑based checks (JSON format, SOP completeness, hallucination detection). Qualified responses receive model‑generated scores; pairs of a passing and a failing response become training data for DPO. The DPO loss encourages higher rewards for good responses and lower rewards for bad ones.
Future Plans
Plans include open‑sourcing the 14B model, expanding tool and task diversity (e.g., programming and web generation), and introducing richer evaluation methods such as visual assessment for HTML output and automated test‑case generation for software.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
