How TDSQL‑C Accelerates Parallel Queries in Cloud‑Native Databases
This transcript details the design, implementation, and performance evaluation of TDSQL‑C's parallel query feature, covering product background, workload challenges, execution plans, Amdahl's law analysis, task scheduling, data partitioning, and future enhancements for cloud‑native database systems.
Reference Materials
TDSQL‑C product page: https://cloud.tencent.com/product/cynosdb
Column‑store vs. row‑store study: Abadi et al., 2008
TDSQL‑C parallel query user manual: https://cloud.tencent.com/document/product/1003/81872
TPC‑H benchmark specification: https://www.tpc.org/tpch/default5.asp
Amdahl's law: https://en.wikipedia.org/wiki/Amdahl%27s_law
SQL computation model: Astrahan & Chamberlin, 1975
Volcano parallel query system: Graefe, 1994
Immediate‑trigger parallel scheduling: Graefe, 1990
HASH partitioning: Kitsuregawa et al., 1989
Decomposable aggregate functions: https://en.wikipedia.org/wiki/Aggregate_function
Control vs. data flow in parallel DBMS: Teeuw & Blanken, 1993
CSP interaction model: https://en.wikipedia.org/wiki/Communicating_sequential_processes
Product Overview
TDSQL‑C is a cloud‑native relational database built on physical log replication and shared storage. It provides near‑100% MySQL compatibility, high‑availability clustering, and elastic scaling. The architecture separates compute and storage, allowing cost‑effective expansion of CPU cores without moving data.
Parallel Query Feature
The parallel query engine introduces *query‑level concurrency*: a single SQL statement is split into multiple tasks that run on different CPU cores. Users first allocate a core budget (e.g., 4 cores) for a session; the optimizer then generates a parallel execution plan containing parallel‑capable operators and special exchange operators for data movement between threads.
Performance was measured with the TPC‑H benchmark (22 queries representing typical analytical workloads). Speed‑up varies per query and follows Amdahl's law: overall acceleration = 1 / ((1‑p) + p/s), where p is the parallelizable fraction and s is the local speed‑up. Most queries showed noticeable gains, while queries with a small p (e.g., heavy single‑threaded aggregation) exhibited limited improvement.
Parallelism Principles
Parallel execution is achieved by decomposing both tasks and data :
**Task decomposition** – the optimizer inserts iterator boundaries (e.g., after a filter or join) and treats each sub‑iterator as an independent task.
**Data decomposition** – input tables are partitioned (hash or range) so that each task works on a disjoint subset.
Example: a SELECT SUM(col) FROM tbl aggregation. In a serial plan, each row updates a global hash table, then a final pass emits the sum. In the parallel plan, the table is split into two halves, each half builds a local hash table in parallel, and a final merge operator combines the two partial sums. Because the final merge is sequential, the speed‑up is less than 2×.
The SQL computation model consists of three layers:
Relational operators (filters, joins, scans)
Statistical functions (e.g., SUM, COUNT) that require a full‑set scan and therefore introduce a pipeline break.
Expression evaluation (row‑wise arithmetic, user‑defined functions)
Hash‑based partitioning enables parallel joins: the build side is hashed on the join key, and each probe thread only receives matching partitions, reducing data shuffling.
Scheduling Framework
A dedicated scheduling framework coordinates task execution:
The optimizer produces a task graph that captures dependencies between parallel tasks.
A coordinator thread (which may be a dedicated thread or the user thread) dispatches tasks to worker threads.
Worker threads execute their assigned tasks, report status, and participate in error handling integrated with MySQL’s exception mechanism.
Task representation follows a “task shape” model that reuses the optimizer‑generated physical plan. The shape is intercepted, cached, and reused for subsequent executions, avoiding redundant plan construction. When a parallel‑capable operator is detected, the framework:
Defines task boundaries at iterator edges.
Partitions input data according to the operator’s requirements.
Allocates the requested core resources.
Data Partitioning and Exchange
InnoDB stores tables as B+‑trees. The partitioner recursively splits index ranges until each partition contains enough rows for a worker thread. If a partition is still too large, it is further subdivided at a lower tree level. This logic is based on the existing community code and extended to support both forward and reverse scans and multiple index types.
Data exchange between upstream and downstream tasks is performed by *bridge* operators. Each bridge acts as a network node that receives rows from its producer thread, buffers them, and forwards them to the consumer thread(s). This design decouples the execution of independent tasks while preserving the logical data flow of the original plan.
Future Directions
Planned enhancements include:
Extending parallel query coverage to more operator types (e.g., window functions, sub‑queries).
Deep performance tuning of partitioning heuristics and exchange buffering.
Hybrid row‑column scheduling that combines vectorized columnar execution with row‑based processing for additive speed‑ups.
Improving visibility into the execution engine (trace logs, visual plan explorers) to aid debugging and performance tuning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
