Big Data 13 min read

Performance Platform: Accelerating Data Production and Consumption

This article details how the Performance Platform at Baidu speeds up data production and consumption across the company's R&D pipelines by introducing five optimization paths, 18 concrete methods, service tiering, compliance measures, and self‑service analytics for both real‑time memory tables and offline disk tables.

Architecture Digest
Architecture Digest
Architecture Digest
Performance Platform: Accelerating Data Production and Consumption

The Performance Platform is a one‑stop solution for APP performance tracing, providing comprehensive, real‑time analysis services and toolchains. It supports a wide range of products (Baidu APP, mini‑programs, Matrix APP, etc.) and processes billions of data points daily, covering over 600 million users.

Key Terminology

TuLing : next‑generation data‑warehouse platform with improved real‑time computation, storage, query engine, and resource scheduling.

UDW : Universal Data Warehouse, Baidu's early data‑warehouse offering unified, high‑quality user‑behavior data.

TM : workflow‑oriented distributed computing system with high reliability and throughput, supporting near‑real‑time stream processing.

YiMai : self‑service analytics tool that integrates multiple data sources and enables visual query building without SQL knowledge.

AFS : Baidu's Append‑Only Distributed File System, used for offline computing, AI training, and backup.

Business Overview

The platform provides end‑to‑end performance monitoring, from crash and lag detection to log retrieval, network analysis, and disk usage, serving multiple product lines.

Challenges

Data Production Stage : legacy infrastructure, insufficient storage, slow query engines, lack of real‑time compute, and weak resource elasticity hinder data throughput. Service level definitions are unclear, leading to high latency (minutes to days) and redundant data collection.

Data Consumption Stage : compliance difficulties, fragmented data export paths, slow report generation, and heavy manual development cycles reduce user satisfaction.

Optimization Paths

3.1 New vs. Old Infrastructure

Adopt a stream‑batch unified approach: replace static QE engine imports to UDW with real‑time TM parsing and Spark dynamic indexing into TuLing, flattening nested fields early to eliminate intermediate tables.

3.2 Service Tiering

Improve service efficiency by better understanding user needs.

Enhance service quality and satisfaction.

Optimize resource allocation based on tiered demands.

3.3 Data Point / Metric / Report Governance

Define clear SLA for data flow (real‑time minutes, near‑real‑time 30 min, offline hourly) and enforce governance through standardized pipelines.

3.4 Data Compliance Governance

Following China’s Data Security Law, enforce data classification, export controls, and unified data‑access interfaces to ensure compliance while maintaining performance.

3.5 Self‑Service Data Construction

3.5.1 Real‑Time (In‑Memory) Self‑Service

Parse UBC log schemas offline, then at runtime flatten logs into an in‑memory wide table using built‑in functions and custom QLExpress functions. Apply aggregation templates (PV, UV, quantiles) to produce metrics, which can be visualized (line charts, tables) and used for threshold‑based alerts.

3.5.2 Near‑Real‑Time & Offline (Disk) Self‑Service

Layered data warehouses (e.g., Feed wide tables) reduce data volume at each tier, improving SLA for downstream reports. Users consume these wide tables via self‑service platforms to quickly generate dashboards.

Future Outlook

Continue to improve data timeliness (align promised vs. actual report delivery), data accuracy (compare expected cases with actual data), and overall compliance‑driven empowerment for business growth.

Code Example: Metadata Query Schema

// Schema 数据获取能力描述
// 协议能力描述:
// 1. 数据查询能力,多引擎/标准SQL查询能力封装「如:palo/mysql/clickhouse/es-sql等」Query结构。
// 2. 数据聚合能力,具备单关键字数据组合&Merge能力,如果Len(Schema.Query)>1:具备数据聚合能力.
// 3. 数据缓存能力,两层级缓存能力封装,Cache结构。

type Schema struct {
  // 元数据查询能力描述
  Query []Query `json:"query"`
  // 元数据整体缓存能力描述
  Cache Cache `json:"cache"`
}

// Query 数据查询能力描述
type Query struct {
  // 结构化查询描述
  SQL string `json:"sql"`
  // 产品线信息, rpc_name = meta_{engine}_{prod}.toml, rpc通信具备超时控制、服务发现、高级负载均衡策略等稳定性提升能力
  Prod string `json:"prod"`
  // 存储引擎描述, 调用不同引擎能力
  Engine string `json:"engine"`
  // 单次查询缓存能力描述
  Cache Cache `json:"cache"`
}
data engineeringBig DataETLself‑service analyticsdata complianceperformance platform
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.