Big Data 12 min read

Scaling TB‑Level Price Computations with Apache Spark: Suning’s Architecture and Optimizations

This article details how Suning built a Hadoop‑based big data platform and leveraged Apache Spark to process terabytes of product price and inventory data, describing the system architecture, four key technical practices, performance results, and future data‑lake directions.

dbaplus Community
dbaplus Community
dbaplus Community
Scaling TB‑Level Price Computations with Apache Spark: Suning’s Architecture and Optimizations

Suning Big Data Platform Overview

In 2013 Suning built a Hadoop‑based big data platform that provides storage and compute for all business units. For retail‑center supply‑chain calculations they later adopted Apache Spark to handle both offline and online massive data processing.

Price Computation System Using Spark

The system integrates data from multiple upstream sources (DB2, MySQL) to generate price and inventory information for all sellable products. The workflow consists of extracting full data with Spark, joining and aggregating into analytical dimensions, applying Spark Map transformations, and persisting results to HDFS and Hive external tables.

Key Technical Practices

1. DataFrame‑based Massive Data Extraction

Data is pulled directly from source databases via SparkSQL’s JDBC interface, loading up to billions of rows into ~1000 DataFrames. This approach is lighter than Sqoop, reduces scheduling overhead, and allows dynamic table selection within Spark code.

Optimizations include schema‑aware loading and caching, reducing DataFrame load time from ~30 minutes to under 5 minutes.

DataFrame creation before optimization
DataFrame creation before optimization
DataFrame creation after optimization
DataFrame creation after optimization

2. Multi‑Dimensional Joins with SparkSQL & ZipPartition

To handle hierarchical dimensions (national, regional, city) the team first attempted sequential left‑joins with fallback, but this caused repeated data scans. By using ZipPartition they performed a single left‑join with priority flags, then a GroupBy to resolve priorities, minimizing cache usage.

DataSet LeftJoin DimensionA => DataSetA
DataSetA Filter(A.Field == NULL) => DataSetToJoinB
DataSetA Filter(A.Field != NULL) => DataSetAFinal
DataSetToJoinB LeftJoin DimensionB => DataSetB
...
DataSetFinal = DataSetAFinal UNION DataSetBFinal UNION DataSetCFinal

3. Driver‑Side Parallel Loading

When processing ~20 billion rows, a single‑core load step took five minutes, violating strict latency SLAs. The solution caches the heavy tables, forces an eager count() to materialize them, and blocks the driver until loading completes, then reuses the cached data.

Parallel loading diagram
Parallel loading diagram

4. ClassLoader Issues and Token‑Based JDBC

Because production database credentials cannot be packaged, the team wrapped JDBC access in a token‑based library loaded via a custom ClassLoader. In the cluster the driver used AppClassLoader, which could not read HTTP‑based JAR tokens, so they swapped the thread’s ClassLoader to the one belonging to the business code.

Practical Takeaways

Loading massive tables directly with SparkSQL JDBC is feasible but requires careful DataFrame creation and driver‑side optimizations.

ZipPartition provides a powerful way to resolve multi‑level dimension joins when standard SparkSQL joins are insufficient.

Driver‑side parallel materialization can shave minutes off critical paths, provided the data size is manageable.

Custom ClassLoader handling is essential when configuration files are packaged inside JARs.

Future Direction

Suning plans to evolve toward a DataLake + Big Data Warehousing architecture, continuing to build an integrated Spark‑Storm processing platform that supports real‑time data services, governance, and analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ETLdistributed computingApache SparkDataFrames
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.