Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice
The article explains how Tencent’s TGSpark leverages Spark DataSource V2 to create a custom source for TGMars storage, detailing shard‑aware design, push‑down of columns and filters, columnar batch loading, partition‑location reporting, and experimental results that show reduced shuffles and improved local computation when executor placement matches storage nodes.
This article discusses how Tencent's TGSpark implements custom data sources using Spark's DataSource API V2 to load data from TGMars storage.
Why Custom Data Sources are Needed:
TGMars storage has unique characteristics: it maintains consistent data table shard rules within the same business, manages data file upload and distribution, ensures even distribution of shard data across storage nodes, and guarantees that the same shard data resides on the same group of machines. This design allows computation nodes to process data locally without shuffle operations, which is a significant optimization for data-intensive workloads.
Spark DataSource API V2 Features:
The article explains the key interfaces of DataSource API V2:
SupportsPushDownRequiredColumns: Pushes column selection down to the data source, allowing the data source to provide only necessary data
SupportsPushDownFilters: Pushes filter conditions to the data source for early data filtering
SupportsScanColumnarBatch: Provides columnar data loading without transposing columnar storage to row format
SupportsReportPartitioning: Reports data partitioning rules to Spark to eliminate unnecessary shuffle operations
InputPartition.preferredLocations: Reports partition locations (IPs) to enable local data loading
Experimental Results:
The article presents experiments demonstrating: column selection optimization where ScanV2 tasks only output selected columns in the physical plan; filter pushdown where WHERE conditions (e.g., platid=0 and level=10) are pushed to ScanV2 tasks, eliminating redundant Filter operations; and partition rule application where queries with primary key columns can skip shuffle operations for local computation.
Local Computation Considerations:
The article notes that for true local computation, the number of Executors should match the number of storage nodes, and Executors must be evenly distributed across nodes. Future work will explore elastic data computation binding mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
