Databases 24 min read

Large Model + OLAP: Enabling a New Data Service Platform

This article details how Tencent Music combines large language models with an Apache Doris‑based OLAP engine, introduces a semantic layer, manual‑experience routing, schema mapping and plugin integration, and outlines the evolution of its data architecture through four versions to achieve real‑time, cost‑effective, and scalable intelligent data services.

DataFunTalk
DataFunTalk
DataFunTalk
Large Model + OLAP: Enabling a New Data Service Platform

Current large language model (LLM) applications are driving a new wave of technology and business innovation worldwide. Tencent Music, leveraging its massive user base and diverse scenarios, explores LLM use cases and builds an efficient, real‑time unified analytical OLAP engine on Apache Doris to improve model conversion efficiency and result accuracy, ultimately delivering personalized, real‑time intelligent data services.

The traditional data service stack—SQL query platform, fixed dashboards, custom analysis tools, and manual data runs—poses high learning costs, low flexibility, and inefficient communication between analysts and engineers. Integrating LLMs enables natural‑language queries to be parsed into SQL, eliminating the need for manual SQL writing and rigid dashboards, and creating a new data service model that reduces learning barriers and accelerates analysis.

The solution is divided into four stages:

Semantic Layer: A middle layer that translates technical field names into business terms, allowing analysts to define metrics without directly interacting with the OLAP warehouse. It also simplifies complex multi‑table joins into single‑table queries, improving translation accuracy.

Manual Experience Routing: Simple queries bypass the LLM and are sent directly to the OLAP engine, while complex queries trigger LLM parsing, reducing latency and token‑based costs.

Schema Mapper & Knowledge Base: A schema‑mapping component matches user inputs with internal knowledge, enhancing private‑domain understanding and enabling accurate responses for domain‑specific queries.

Plugin Integration: Third‑party plugins (embedding local documents or ChatGPT plugins) extend the LLM’s capabilities for customized scenarios such as legal, financial, or regulatory queries.

The resulting "Supersonic" platform combines LLMs and OLAP, where user questions flow through schema mapping, experience routing, semantic processing, and optional plugin calls before reaching the Doris engine for fast query execution.

Data architecture evolution:

Version 1.0: Separate processing, analysis, and application layers with wide tables, leading to data latency, high storage waste, and redundant component maintenance.

Version 2.0: Replaces ClickHouse with Apache Doris, introduces Multi‑Catalog for unified query access, and adds a semantic layer to eliminate duplicate metric definitions.

Version 3.0: Splits wide tables into slow‑changing dimension tables and metric tables, imports them into Doris via Hive, and leverages the semantic layer for unified metric access.

Version 4.0: Uses Doris Rollup and materialized views with GROUP BY and SUM to accelerate queries, applies CREATE VIEW with end_date='9999-12-31' for real‑time data exposure, and enriches the inverted index for faster lookups.

Apache Doris performance optimizations include:

Colocate Join: Separates wide tables into metric and dimension tables, partitions by start_date and end_date , and uses Unique Key models for efficient aggregation.

Rollup: Creates multiple rollup views on base tables to avoid metric explosion and improve query speed with automatic GROUP BY processing.

Materialized Views: Derives new metrics via SUM and accelerates queries by pre‑computing results.

Import performance enhancements:

Flink Doris Connector: Provides exactly‑once streaming writes with two‑phase commit, enabling billion‑scale real‑time ingestion.

Doris Compaction: Introduces Vertical Compaction (partial column reads) and Segment Compaction (parallel merge and ingest) to lower compaction scores and stabilize the pipeline.

Benefits achieved after adopting Apache Doris:

Query latency reduced from minutes to seconds or milliseconds thanks to rollup, materialized views, and inverted indexes.

Import time for 3000+ dimensions and metrics cut from over a day to about 8 hours (≈⅓ of the original).

Architecture simplified by unifying query and analysis engines, removing Elasticsearch clusters.

Storage cost lowered by ~30% and development effort reduced by ~40% through wide‑table splitting.

Future directions include extending Doris’s lake‑house capabilities for multi‑source federation, cross‑cluster data sync, read‑write separation, multi‑region backup, and exploring storage‑compute separation to further reduce costs while maintaining performance.

Performance Optimizationdata warehouseOLAPSemantic Layerlarge language modelApache Doris
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.