How a Spark Offline Framework Boosts Data Backtracking Efficiency
This article introduces a Spark offline development framework that separates configuration from code, supports SQL and Java applications, and provides fast, automated data backtracking with reduced environment preparation time, lower failure rates, and significant performance gains for large‑scale data warehouses.
Background
Apache Spark is the dominant engine for large‑scale data processing, offering multi‑language support and strong performance. Complex data analysis and frequent data‑backtracking tasks in data warehouses often require high manual effort and incur long execution times.
Framework Design
The offline framework is organized into three layers:
Base Framework : abstracts creation of SparkSession, SparkContext, and SparkConf, parses XML configuration files, and provides common UDFs.
Extensible Tools : utility classes for configuration parsing, database I/O, and date handling that are shared by the base framework and applications.
Applications : developers write only the data‑processing logic; the framework handles resource allocation, session management, and task scheduling.
Base Framework
All applications inherit from an Application parent class. Configuration is stored in an XML file that specifies the application class, input/output paths, and Spark resource settings such as spark.executor.memory, spark.executor.cores, spark.driver.memory, and spark.executor.instances. Example:
<?xml version="1.0" encoding="UTF-8"?>
<project name="test">
<class>com.way.app.instance.SqlExecutor</class>
<path>sql/file/path</path>
<conf>
<spark.executor.memory>1G</spark.executor.memory>
<spark.executor.cores>2</spark.executor.cores>
<spark.driver.memory>1G</spark.driver.memory>
<spark.executor.instances>20</spark.executor.instances>
</conf>
</project>Extensible Tools
The toolset includes classes for parsing the XML configuration, reading/writing databases, and handling date arithmetic and conversion. These utilities are used by both the base framework and the application layer.
Application Types
SQL Application
Developers provide only the SQL script and the resource configuration; the framework automatically creates the Spark session and executes the query. Example configuration:
<?xml version="1.0" encoding="UTF-8"?>
<project name="ecommerce_dwd_hanwuji_click_incr_day_domain">
<class>com.way.app.instance.ecommerce.Test</class>
<input>
<type>table</type>
<sql>select ... where event_day='{DATE}' and ...</sql>
</input>
<output>
<type>afs_kp</type>
<path>test/event_day={DATE}</path>
</output>
<conf>
<spark.executor.memory>2G</spark.executor.memory>
<spark.executor.cores>2</spark.executor.cores>
<spark.driver.memory>2G</spark.driver.memory>
<spark.executor.instances>10</spark.executor.instances>
</conf>
</project>Java Application
Developers create a class that extends Application and implement the run() method. The framework supplies the Spark session, input/output abstraction, and supports multiple input types (HDFS files, SQL tables, etc.). Example Java class:
package com.way.app.instance.ecommerce;
import com.way.app.Application;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Row;
import java.util.Map;
import org.apache.spark.sql.Dataset;
public class Test extends Application {
@Override
public void run() {
Map<String, String> input = (Map<String, String>) property.get("input");
Dataset<Row> ds = sparkSession.sql(getInput(input)).toDF("url", "num");
JavaRDD<String> outRdd = ds.filter(row -> {
String url = row.getAs("url").toString();
return url.contains(".jd.com") || url.contains(".suning.com") || url.contains(".taobao.com");
}).toJavaRDD().map(row -> row.mkString("\001"));
Map<String, String> output = (Map<String, String>) property.get("output");
outRdd.saveAsTextFile(getOutPut(output));
}
}Corresponding XML configuration:
<?xml version="1.0" encoding="UTF-8"?>
<project name="ecommerce_dwd_hanwuji_click_incr_day_domain">
<class>com.way.app.instance.ecommerce.Test</class>
<input>
<type>table</type>
<sql>select clk_url, clk_num from test_table where event_day='{DATE}' and click_pv > 0 and is_ubs_spam=0</sql>
</input>
<output>
<type>afs_kp</type>
<path>test/event_day={DATE}</path>
</output>
<conf>
<spark.executor.memory>2G</spark.executor.memory>
<spark.executor.cores>2</spark.executor.cores>
<spark.driver.memory>2G</spark.driver.memory>
<spark.executor.instances>10</spark.executor.instances>
</conf>
</project>Data Backtracking Application
This application automates large‑scale backtracking without code changes. It supports:
Breakpoint‑resume: successful sub‑tasks are recorded; failed tasks restart from the last successful date.
Configurable backtrack order (forward or reverse) via the order parameter.
Parallel backtracking using the distance parameter to define step size.
The framework consolidates environment preparation and release steps across all sub‑tasks, turning a per‑task overhead of 5‑20 minutes into a single preparation and a single release, dramatically reducing total runtime.
Performance test on a year‑long serial backtrack reduced execution time from approximately 2.5 days to about 6 hours (over 90 % efficiency gain) with minimal human monitoring.
Usage
Developers focus solely on the data‑processing logic. After packaging, the application is submitted with an XML configuration that includes: class: fully‑qualified application class (required). type: application type (default sql, or java for Java classes). path: path to the SQL script or Java class resources. limitdate and startdate: backtrack date range. order: 1 for forward, -1 for reverse order. distance: step size for parallel backtracking (‑1 for serial). file: log file that records successfully processed dates for breakpoint‑resume. conf: Spark resource settings (memory, cores, instances, etc.).
Example configuration for a backtrack job:
<?xml version="1.0" encoding="UTF-8"?>
<project name="ecommerce_ads_others_order_retain_incr_day">
<class>com.way.app.instance.ecommerce.Huisu</class>
<type>sql</type>
<path>/sql/ecommerce/ecommerce_ads_others_order_retain_incr_day.sql</path>
<limitdate>20220404</limitdate>
<startdate>20210101</startdate>
<order>1</order>
<distance>-1</distance>
<file>/user/ecommerce_ads_others_order_retain_incr_day/process</file>
<conf>
<spark.executor.memory>1G</spark.executor.memory>
<spark.executor.cores>2</spark.executor.cores>
<spark.executor.instances>30</spark.executor.instances>
<spark.yarn.maxAppAttempts>1</spark.yarn.maxAppAttempts>
</conf>
</project>Future Outlook
The current framework supports only SQL and Java applications. Future work will extend support to Scala, Python, and R, and improve multi‑language development capabilities to further simplify big‑data application building.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
