Big Data 12 min read

Impala Architecture, Concurrency, CBO Join Optimization, and Storage Layer in Tencent Financial Big Data Scenario

This article explains how Tencent's financial big‑data platform uses Impala, detailing its overall architecture, concurrency mechanisms, cost‑based join optimization, storage layer design, and practical performance‑tuning experiences to achieve fast, interactive analytics.

DataFunSummit

Oct 11, 2021

Impala Architecture, Concurrency, CBO Join Optimization, and Storage Layer in Tencent Financial Big Data Scenario

Introduction

In Tencent's financial scenario massive data is generated daily; to improve interactive analysis and agile decision‑making, Impala is introduced as the core query engine. The article shares Impala’s architecture, principles, case studies of deployment, optimization, and reflections.

Impala Architecture

Impala stores data in Kudu (real‑time click‑stream) and HDFS (bulk data). It supports interactive analysis, tag factories, user‑profile analysis, and AB‑testing. Key characteristics include non‑YARN resource allocation, resident processes, RPC‑based data shuffle with batch streaming, and LLVM‑generated code for high performance.

Concurrency Principles

1. Thread hierarchy – instance threads, scan (decompression) threads, and I/O threads. Impala’s default mode separates compute and scan, while the multi‑thread mode merges scan into instance threads.

2. Two high‑concurrency issues: ① Severe jitter caused by RPC‑related EXCHANGE queues becoming full; resolved by increasing datastream_service_num_deserialization_threads from default to 80 (CPU hyper‑threads = 96). ② Cast‑to‑string problem that degrades concurrency.

3. Concurrency constraints – file‑level granularity, small tables leading to large files (low parallelism), or splitting files (high parallelism but low I/O efficiency) both limit performance.

Cost‑Based Join (CBO) Optimization

1. Basic workflow – collect statistics (row count, selected columns, join type), decide broadcast vs. partition vs. ordered join, and choose which side is the large table.

2. Three typical problems and solutions: ① Outer‑join data skew – use a “pivot‑table” strategy to control join order via straight_join. ② Inconsistent GROUP BY across sub‑queries – enforce uniform group order to reduce unnecessary exchanges. ③ Mis‑estimated cardinalities under average‑distribution assumption – avoid unwanted broadcast joins.

3. Broad applications – virtual cube for cross‑topic analysis, reducing the need for massive user‑profile packages and improving efficiency of multi‑table joins.

Storage Layer

1. Principles – pre‑computation or on‑the‑fly indexing, columnar storage with statistics to reduce I/O. Columnar layout enables reading only needed columns, minimizing disk seeks.

2. Data filtering – three‑step filter using row‑group statistics, page index, and dictionary encoding (e.g., PLAIN_DICTIONARY) to prune data early.

3. Optimizations – global sorting, hash‑partitioned sorting, and Z‑order to improve filter efficiency.

Practical impact – profile analysis time reduced from 20 minutes to 1 minute for 18 million users, enabling interactive user‑profile queries.

Summary and Reflections

1. OLAP engine performance‑optimization roadmap – vectorization, dynamic code generation, and reducing type‑branching.

2. Impala‑specific optimization ideas – problem‑driven (basic & advanced) and metric‑driven (stress testing CPU, I/O, network) approaches.

Overall, after understanding Impala’s architecture, processing logic, and storage theory, readers gain a deeper insight into building high‑performance interactive analytics on big‑data platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data concurrency Query Optimization OLAP Tencent Impala storage layer

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.