Big Data 10 min read

Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice

The article explains how Tencent’s TGSpark leverages Spark DataSource V2 to create a custom source for TGMars storage, detailing shard‑aware design, push‑down of columns and filters, columnar batch loading, partition‑location reporting, and experimental results that show reduced shuffles and improved local computation when executor placement matches storage nodes.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice

This article discusses how Tencent's TGSpark implements custom data sources using Spark's DataSource API V2 to load data from TGMars storage.

Why Custom Data Sources are Needed:

TGMars storage has unique characteristics: it maintains consistent data table shard rules within the same business, manages data file upload and distribution, ensures even distribution of shard data across storage nodes, and guarantees that the same shard data resides on the same group of machines. This design allows computation nodes to process data locally without shuffle operations, which is a significant optimization for data-intensive workloads.

Spark DataSource API V2 Features:

The article explains the key interfaces of DataSource API V2:

SupportsPushDownRequiredColumns: Pushes column selection down to the data source, allowing the data source to provide only necessary data

SupportsPushDownFilters: Pushes filter conditions to the data source for early data filtering

SupportsScanColumnarBatch: Provides columnar data loading without transposing columnar storage to row format

SupportsReportPartitioning: Reports data partitioning rules to Spark to eliminate unnecessary shuffle operations

InputPartition.preferredLocations: Reports partition locations (IPs) to enable local data loading

Experimental Results:

The article presents experiments demonstrating: column selection optimization where ScanV2 tasks only output selected columns in the physical plan; filter pushdown where WHERE conditions (e.g., platid=0 and level=10) are pushed to ScanV2 tasks, eliminating redundant Filter operations; and partition rule application where queries with primary key columns can skip shuffle operations for local computation.

Local Computation Considerations:

The article notes that for true local computation, the number of Executors should match the number of storage nodes, and Executors must be evenly distributed across nodes. Future work will explore elastic data computation binding mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSparkColumn PushdownCustom Data SourceDataSource API V2Filter PushdownLocal ComputationTGMars
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.