Industry Insights 37 min read

How a R&D Data Platform Leverages Large Language Models to Accelerate Issue Diagnosis

The article explains how the R&D data middle platform integrates large language models to automate data collection, real‑time monitoring, intelligent analysis, and rapid root‑cause identification for online issues, detailing the architecture, wide‑table modeling, generative BI, attribution algorithms, RAG enhancements, and future optimization plans.

Baidu Geek Talk

Sep 2, 2024

How a R&D Data Platform Leverages Large Language Models to Accelerate Issue Diagnosis

Platform Overview

The R&D Data Middle Platform (also called the Performance Platform) is a one‑stop solution designed for APP performance tracking. It provides real‑time, end‑to‑end application performance monitoring, helping APP teams improve the efficiency of online problem investigation and resolution.

Coverage : Over 50 internal APPs, mini‑programs, browsers, and externally acquired APPs.

Scale : Processes nearly a hundred billion data records daily, peaks at 300 k QPS, and achieves sub‑second end‑to‑end ingestion.

Business Support Capabilities

Visualization Reports : General‑purpose dashboards such as problem overview, APP start‑up speed, and user analysis.

Wide Tables & Datasets : Self‑service analysis via the internal BI platform (TDA) for customized metrics like search reach rate and Feed page smoothness.

End‑to‑End Services : Integrated data reports and platform tools that support intelligent online problem analysis and cloud‑controlled probing.

Motivation for Large‑Model Integration

Rapid business changes raised the demand for higher usability of the data platform. Traditional analysis required hours of manual work, and key metrics such as MTTI (Mean Time To Identify) remained at hour‑level latency. The emergence of ChatGPT and Wenxin Yiyan prompted the exploration of large language models (LLMs) to reconstruct inefficient workflows and dramatically shorten diagnosis time.

Agent‑Based Intelligent Analysis Architecture

The system consists of two core stages:

Controller : Performs intent recognition on incoming queries and routes them to appropriate task‑planning modules.

Agent Collection : Each functional module is encapsulated as an independent Agent. The controller stitches Agents into a multi‑Agent workflow, preserving task state for downstream Agents.

The overall answer accuracy depends on both the controller’s planning ability and each Agent’s execution precision.

Data Infrastructure

Wide‑Table Modeling

Traditional layered warehouses (ODS → DWD → DWS → ADS) caused excessive table proliferation, complex joins, and low query performance. To address these issues, the platform adopted a wide‑table strategy:

Comprehensiveness : Covers all business scenarios with a limited set of tables.

Timeliness : Reduces latency caused by upstream data‑source timing differences.

Usability : Enables analysts to obtain required insights with a few wide tables, avoiding multi‑table joins.

Accuracy : Unified logic and clear metric definitions eliminate ambiguity.

Intelligence : Supports natural‑language query (Text2Viz) and automatic attribution.

Generative BI and Automatic Attribution

The platform provides three core capabilities:

Data Analysis & Attribution : Users can formulate queries in natural language; the system translates them into optimized SQL or function calls, automatically identifies the most influential dimensions, and generates concise summaries.

Automatic Chart Generation : Based on query results, appropriate visualizations are produced without manual configuration.

Intelligent Summarization : The LLM produces a narrative explanation of the analysis, highlighting key drivers.

Attribution algorithms include:

Super‑Mean (micro) – quantifies the contribution of individual items compared with historical averages.

Variance (macro) – measures overall distribution characteristics.

Gini Coefficient (macro) – assesses inequality of metric distribution.

Contribution (micro) – isolates specific factors causing overall volatility.

JS Divergence (macro) – compares expected vs. actual business distributions.

Composite Contribution (micro) – decomposes changes into numerator and denominator effects.

Retrieval‑Augmented Generation (RAG) and Knowledge Enhancement

To answer queries that require private or up‑to‑date knowledge, the platform adopts a RAG pipeline:

Chunk the knowledge base and embed each chunk with a transformer encoder.

Store embeddings in a vector database.

At query time, retrieve the most relevant chunks, construct a prompt that includes retrieved content, and let the LLM generate the answer.

Identified drawbacks (limited similarity matching, retrieval latency, fragmented information, fixed chunk size) are mitigated through:

Metadata enrichment and business‑level indexing.

Dynamic chunk sizing and sentence‑window retrieval.

Hybrid retrieval (vector + BM25) with re‑ranking (RRF).

Prompt engineering with role, skills, task decomposition, schema linking, and few‑shot examples.

Domain‑specific fine‑tuning using the “Wenxin‑speed” model.

Smart Diagnosis and Damage‑Control Recommendations

The end‑to‑end workflow for an online issue is:

Collect real‑time alarm data (e.g., crash spikes, product line, system).

LLM‑driven automatic attribution identifies the most likely dimensions (APP version, OS, component).

Cross‑reference with release‑ticket data (via AFS files, task‑group dependencies, or field‑status services) to recommend the most relevant release tickets.

Generate a step‑by‑step damage‑control plan (e.g., host‑site shielding, query‑based investigation, replication scripts).

Perform one‑click intelligent diagnosis by combining problem features with the recommended ticket’s code, leveraging LLM reasoning.

Model parameters are tuned per scenario (e.g., temperature 0.2 & top_p 0.1 for data analysis, temperature 0.7 & top_p 0.8 for creative writing) to balance accuracy and diversity.

Data Flywheel

User queries and feedback are continuously harvested, cleaned, and re‑labeled to become new training data. A case library, largely auto‑generated by the LLM and manually verified, stores resolved issue patterns. The flywheel loop – query → case generation → model improvement → better query response – creates a positive feedback cycle that reduces MTTR over time.

Future Directions

Further improve LLM accuracy and latency by incorporating specialized vertical models and expanding high‑quality labeled corpora.

Evolve the Agent architecture from semi‑autonomous to fully autonomous planning, eliminating dead‑ends and error propagation.

Extend intelligent summarization to weekly and monthly reports, delivering automated business insights and variance attribution.

Overall, the platform demonstrates how large language models can reconstruct low‑efficiency workflows, achieve near‑real‑time root‑cause analysis, and continuously enhance data‑driven operations.

large language models data platform Retrieval-Augmented Generation intelligent analysis generative BI online issue diagnosis

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.