Designing an AI-Powered Experiment Analysis Agent: Architecture, Workflow, and Future Enhancements
This article outlines the motivation, design, architecture, engineering implementation, large‑model selection, and future improvement plans for an AI‑driven experiment analysis agent that integrates data aggregation, modular workflow orchestration, and interactive frontend features to streamline AB‑test insights.
1. Background
Traditional algorithm experiments focus on core metrics such as UCTR and UCVR, but solely improving positive metrics is insufficient; a comprehensive analysis must also identify hidden risks and negative impacts on other key indicators to ensure decisions are balanced and safe.
Different experiments often produce divergent or contradictory results, indicating underlying causal relationships that need to be understood for better evaluation and iteration quality.
Existing experiment analysis tools are fragmented across platforms, lacking a unified workflow, which motivated the development of a dedicated analysis Agent.
2. Product Showcase
Report Example
* The report is anonymized for data and business security.
3. Product Design
3.1 Architecture Design
The analysis approach follows a top‑down, hierarchical method: start with macro‑level inspection, identify anomalies, then drill down to finer granularity for validation, mirroring a human analyst’s workflow.
Inspired by the AI Agent Manus, which decomposes user queries into structured task lists, performs online retrieval, and generates summarized outputs, the proposed framework adopts a "summarize‑data → sub‑analysis → summarize‑output" pipeline.
To keep costs low, existing departmental reporting tools are reused for core data engineering, while new capabilities such as FDR analysis are developed independently and later stitched together.
The overall architecture is modular and layered:
Application Layer: Core business hub integrating analysis planning, theme analysis (implemented via a DAG workflow execution framework), and conclusion aggregation, powered by DeepSeek R1/V3 models.
Service Layer: Authentication, frontend form services, etc., built with JDSDK + CHO‑JSF.
Data Layer: Distributed engines (Doris‑X, ClickHouse, Spark) forming OLAP and BDP gateways for real‑time and offline data, respectively.
3.2 Product Design Details
The Agent aims to be a true "assistant" by integrating with the JD ME robot, allowing experiment analysts to access analysis capabilities without leaving their workflow.
A unified frontend form collects experiment ID, period, module, background, and expectations, then transparently forwards key information to various backend services, eliminating inconsistencies across tools.
By leveraging JD ME’s open platform, the Agent provides a seamless, in‑place analysis experience, improving usability and reducing context switching.
3.3 Workflow Design
The initial planner used few‑shot examples to let the LLM suggest analysis method lists, but suffered from inaccurate parsing, inflexible calls, and inability to pass intermediate conclusions.
It was upgraded to a DAG‑based workflow execution framework, enabling parallel execution of independent analyses and serial execution for dependent steps, markedly improving analysis quality.
Future work includes accumulating high‑quality few‑shot samples, then fine‑tuning a dedicated model with reinforcement learning to produce reliable chain‑of‑thought planning.
4. Engineering Technology
Frontend Unified Form
Vue 3 Framework: Developed with Vue.
Component Design: Reusable UI components.
Auto‑completion: Integrated with the trial‑gold interface for experiment name suggestions.
Historical Memory: Stored user form history in JIMDB keyed by ERP + tool name.
Authentication: Implemented via JSSDK for JD ME client authentication and ERP retrieval.
Multi‑Level Authorization
Combined tool permissions with platform permissions to ensure proper access control.
Message Interaction via JD ME
Dynamic message updates using JD ME messaging service and JIMDB to modify card content by Job_id, reducing message volume.
Routing framework that supports both default robot callbacks and custom card update services without conflict.
5. Large Model
5.1 Model Selection
Due to security constraints, only JD’s proprietary Yanxi model and locally deployed DeepSeek R3/V1 models were considered.
R1 showed occasional brilliance but was highly unstable, with error amplification leading to severe hallucinations; therefore a hybrid R1 + V3 multi‑model approach was adopted.
5.2 Generation Quality
Prompt Engineering : Prompt design is the most critical factor for LLM output quality.
Dynamic Few‑Shot : Templates are customized based on experiment attributes and conditions.
Providing structured data plus metric explanations to the model is less effective than converting metrics into descriptive text via code, which mitigates numeric hallucinations.
Mechanism Design : Implemented timeout and output‑quality checks with retry logic to reduce no‑result occurrences for long token inputs.
6. Future Improvements
Knowledge‑Distilled AB‑Test Expert Model : Build a domain‑specific corpus and use model distillation to replace the current R1 + few‑shot solution, enabling continuous learning of analysis patterns and generation of complete causal chain (CoT) experiment plans.
More Flexible Data Engineering Framework : Shift from a one‑time data preparation to on‑demand data fetching based on prior analysis conclusions, e.g., providing sample size and MDE analysis as a micro‑service integrated into the workflow.
Product Interaction Enhancements : Move from static one‑off reports to interactive, explainable analysis dialogs that expose reasoning steps, visualized inference chains, and support follow‑up queries, thereby increasing transparency and user trust.
7. References
1. https://www.vectara.com/blog/deepseek-r1-hallucinates-more-than-deepseek-v3
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
