Big Data 10 min read

Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice

The article explains how Tencent’s TGSpark leverages Spark DataSource V2 to create a custom source for TGMars storage, detailing shard‑aware design, push‑down of columns and filters, columnar batch loading, partition‑location reporting, and experimental results that show reduced shuffles and improved local computation when executor placement matches storage nodes.

Tencent Cloud Developer

Jul 24, 2019

Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice

This article discusses how Tencent's TGSpark implements custom data sources using Spark's DataSource API V2 to load data from TGMars storage.

Why Custom Data Sources are Needed:

TGMars storage has unique characteristics: it maintains consistent data table shard rules within the same business, manages data file upload and distribution, ensures even distribution of shard data across storage nodes, and guarantees that the same shard data resides on the same group of machines. This design allows computation nodes to process data locally without shuffle operations, which is a significant optimization for data-intensive workloads.

Spark DataSource API V2 Features:

The article explains the key interfaces of DataSource API V2:

SupportsPushDownRequiredColumns: Pushes column selection down to the data source, allowing the data source to provide only necessary data

SupportsPushDownFilters: Pushes filter conditions to the data source for early data filtering

SupportsScanColumnarBatch: Provides columnar data loading without transposing columnar storage to row format

SupportsReportPartitioning: Reports data partitioning rules to Spark to eliminate unnecessary shuffle operations

InputPartition.preferredLocations: Reports partition locations (IPs) to enable local data loading

Experimental Results:

The article presents experiments demonstrating: column selection optimization where ScanV2 tasks only output selected columns in the physical plan; filter pushdown where WHERE conditions (e.g., platid=0 and level=10) are pushed to ScanV2 tasks, eliminating redundant Filter operations; and partition rule application where queries with primary key columns can skip shuffle operations for local computation.

Local Computation Considerations:

The article notes that for true local computation, the number of Executors should match the number of storage nodes, and Executors must be evenly distributed across nodes. Future work will explore elastic data computation binding mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Spark Column Pushdown Custom Data Source DataSource API V2 Filter Pushdown Local Computation TGMars

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.