Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation

This article shares practical experience on deploying private Agentic AI, covering background, architecture design, challenges, data generation, reinforcement learning with DPO, automated multi‑dimensional evaluation, and future plans for open‑source models and richer tool integration.

DataFunSummit
DataFunSummit
DataFunSummit
Boosting Private Agentic AI: LLM Post‑Training, DPO, and End‑to‑End Evaluation

Background

Agentic AI differs from Chat or Workflow AI by requiring autonomous path planning, tool usage, and self‑reflection. It must independently generate solutions, call external functions (e.g., search engines, knowledge bases), and evaluate whether retrieved information is sufficient, deciding to continue searching, stop, or write results to files.

Challenges of Private Deployment

Three major challenges arise when deploying Agentic AI privately: (1) high task complexity demanding many tool calls; (2) diverse tool requirements across scenarios, which large open‑source models struggle to handle; (3) substantial resource consumption due to long context windows and high GPU memory needs.

Overall Architecture Design

The system generates queries, creates diverse resources, runs an agent framework to obtain prompt‑response pairs, and applies end‑to‑end scoring to filter high‑quality data. The filtered data feed supervised fine‑tuning (SFT) and subsequently reinforcement learning with Direct Preference Optimization (DPO) using a process‑level compare method.

Data Generation for SFT

Ensure tool selection correctness, parameter usage, and proper hand‑off between agents.

Guarantee accurate problem‑solving paths.

Promote diversity in queries across domains, topics, and output formats.

Automated Multi‑Dimensional Evaluation

Inspired by WritingBench, the evaluation generates multiple dimensions and scoring standards for each response, then aggregates scores. Few‑shot prompting improves alignment with human scores, while multi‑dimensional methods achieve the smallest gap between model and human evaluations.

DPO Data Synthesis

For each query, the model generates several responses. Each response undergoes rule‑based checks (JSON format, SOP completeness, hallucination detection). Qualified responses receive model‑generated scores; pairs of a passing and a failing response become training data for DPO. The DPO loss encourages higher rewards for good responses and lower rewards for bad ones.

Future Plans

Plans include open‑sourcing the 14B model, expanding tool and task diversity (e.g., programming and web generation), and introducing richer evaluation methods such as visual assessment for HTML output and automated test‑case generation for software.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Model Evaluationreinforcement learningLLM fine-tuningAgentic AIDPOPrivate Deployment
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.