Big Data 13 min read

Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform

Tencent’s iData analysis center selected Spark as its new computing platform because, unlike ElasticSearch, TiDB, and other MPP solutions, Spark offers iterative processing, shuffle support, robust SQL and DAG scheduling, and flexible SMP‑style data exchange, enabling efficient OLAP on billions of game‑user records.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform

This article discusses Tencent's iData analysis center's journey in selecting a computing platform for their new data analysis system. The author explains why they chose Spark over other solutions like ElasticSearch and TiDB.

Background: Tencent's iData platform serves 567+ game businesses with data covering over 1.5 billion game users. After years of operation, they recognized limitations in their old analysis system and decided to build a new computing platform.

Why Not the Old System: The previous analysis system resembled an incomplete MapReduce structure with two major limitations: 1) Only supporting single MapReduce processes without iteration capability for complex queries; 2) Lacking Shuffle capability, which is essential for aggregating results in certain scenarios.

Evaluation of Other Solutions:

1. ElasticSearch: Initially considered for its horizontal scaling and near real-time text retrieval. However, ES query nodes would fail when extracting large datasets (millions of records), making it unsuitable for their needs.

2. TiDB (HTAP Database): Offers 100% TP and 80% AP capabilities with MySQL compatibility. However, testing revealed it struggles with large result sets (millions to hundreds of millions of records), as data computation is partially pushed to KV storage and汇总 to a single server - a typical MPP architecture limitation.

3. MPP vs SMP: MPP (Massively Parallel Processing) is Share Nothing - all computations start in parallel and only汇总 results at the end, causing bottleneck at final nodes. SMP (Symmetric Multi-Processing) allows data exchange between nodes but with overhead. The author concludes SMP is more suitable for OLAP computations.

4. TiSpark: Uses Spark for computation while loading TiKV data via gRPC. However, concerns about operational costs and data control led to its rejection.

5. Other Spark-based Solutions: Explored SnappyData (Spark + GemFire for HTAP) and CarbonData (Spark + HDFS with custom indexing), both demonstrating Spark's extensibility.

Why Spark: The author chose Spark for its complete SQL computing capability (Spark SQL) and powerful DAG task scheduling for big data processing. They customized TGSpark to work with their TGMars storage, creating a custom file format to achieve storage-compute binding and meet their specific requirements.

big datadata platformTiDBOLAPDistributed ComputingSparkMPPSMP
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.