Big Data 11 min read

Practical Guide to Building an Advertising Project with Spark and Kudu

This article provides a step‑by‑step tutorial on deploying a Spark‑based advertising data pipeline using Kudu, covering Hadoop setup, HDFS data loading, Spark application refactoring, Maven packaging, Yarn execution, and crontab scheduling for daily automated runs.

Big Data Technology & Architecture

Aug 21, 2020

Practical Guide to Building an Advertising Project with Spark and Kudu

Goal : Package and run the Spark + Kudu advertising project on a server.

1. Put data on HDFS

Start Hadoop services and create a daily directory in HDFS (YYYYMMDD format), then upload the JSON data and IP rule file.

[hadoop@hadoop000 ~]$ cd app/
[hadoop@hadoop000 app]$ ls
apache-maven-3.6.3      hive-1.1.0-cdh5.15.1  spark-2.4.5-bin-hadoop2.6
hadoop-2.6.0-cdh5.15.1  jdk1.8.0_91           tmp
[hadoop@hadoop000 app]$ cd hadoop-2.6.0-cdh5.15.1/
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ ls
bin  etc  include  LICENSE.txt  README.txt  src
bin-mapreduce1  examples  lib  logs  sbin
cloudera  examples-mapreduce1  libexec  NOTICE.txt  share
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ cd sbin/
[hadoop@hadoop000 sbin]$ ls
distribute-exclude.sh    slaves.sh            stop-all.sh
hadoop-daemon.sh        start-all.cmd        start-all.sh
stop-balancer.sh        hdfs-config.cmd      start-balancer.sh
stop-dfs.cmd            hdfs-config.sh       start-dfs.cmd
stop-dfs.sh             hdfs-daemons.sh      start-dfs.sh
stop-secure-dns.sh      httpfs.sh            start-secure-dns.sh
stop-yarn.cmd           kms.sh               start-yarn.cmd
stop-yarn.sh            yarn-daemon.sh       mr-jobhistory-daemon.sh
[hadoop@hadoop000 sbin]$ ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [hadoop000]
... (log output omitted for brevity) ...

Create the daily directory and upload files:

[hadoop@hadoop000 sbin]$ hadoop fs -mkdir -p /tai/access/20181007
[hadoop@hadoop000 sbin]$ hadoop fs -put ~/data/data-test.json /tai/access/20181007/
[hadoop@hadoop000 sbin]$ hadoop fs -put ~/data/ip.txt /tai/access/

Verify the upload via the HDFS web UI (port 50070).

2. Refactor for scheduled operation

Modify the Spark application to accept runtime parameters (spark.time, spark.raw.path, spark.ip.path) and to generate table names based on the processing date.

package com.imooc.bigdata.cp08

import com.imooc.bigdata.cp08.business.{AppStatProcessor, AreaStatProcessor, LogETLProcessor, ProvinceCityStatProcessor}
import org.apache.commons.lang3.StringUtils
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession

//整个项目的入口
object SparkApp extends Logging{
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local[2]")
      .appName("SparkApp")
      .getOrCreate()
    //spark-submit ... --conf time=20181007
    //spark框架只认spark开头的代码
    val time = spark.sparkContext.getConf.get("spark.time")
    if(StringUtils.isBlank(time)){
      logError("处理批次不能为空")
      System.exit(0)
    }
    LogETLProcessor.process(spark)
    ProvinceCityStatProcessor.process(spark)
    AreaStatProcessor.process(spark)
    AppStatProcessor.process(spark)
    spark.stop()
  }
}

Utility for generating table names:

package com.imooc.bigdata.cp08.utils

import org.apache.spark.sql.SparkSession

object DateUtils {
  def getTableName(tableName:String,spark:SparkSession)={
    val time = spark.sparkContext.getConf.get("spark.time")
    tableName + "_" + time
  }
}

Example usage in processors:

val sourceTableName = DateUtils.getTableName("ods",spark)
val sinkTableName = DateUtils.getTableName("province_city_stat",spark)

Define input paths:

val rawPath = spark.sparkContext.getConf.get("spark.raw.path")
var jsonDF = spark.read.json(rawPath)
val ipRulePath = spark.sparkContext.getConf.get("spark.ip.path")
val ipRowRDD = spark.sparkContext.textFile(ipRulePath)

Run the application with Spark‑submit, passing the three configuration parameters.

3. Package the code

Use Maven to build a JAR, then copy the JAR together with kudu-client-1.7.0.jar and kudu-spark2_2.11-1.7.0.jar to ~/lib on the server.

Start Spark services:

cd ~/app/spark-2.4.5-bin-hadoop2.6/sbin/
sh start-all.sh

Create a local test script job.sh:

time=20181007
${SPARK_HOME}/bin/spark-submit \
  --class com.imooc.bigdata.cp08.SparkApp \
  --master local \
  --jars /home/hadoop/lib/kudu-client-1.7.0.jar,/home/hadoop/lib/kudu-spark2_2.11-1.7.0.jar \
  --conf spark.time=$time \
  --conf spark.raw.path="hdfs://hadoop000:8020/tai/access/$time" \
  --conf spark.ip.path="hdfs://hadoop000:8020/tai/access/ip.txt" \
  /home/hadoop/lib/sparksql-train-1.0.jar

Execute with sh job.sh and verify the results in the UI.

4. Run on Yarn

Create jobyarn.sh by changing --master to yarn:

time=20181007
${SPARK_HOME}/bin/spark-submit \
  --class com.imooc.bigdata.cp08.SparkApp \
  --master yarn \
  --jars /home/hadoop/lib/kudu-client-1.7.0.jar,/home/hadoop/lib/kudu-spark2_2.11-1.7.0.jar \
  --conf spark.time=$time \
  --conf spark.raw.path="hdfs://hadoop000:8020/tai/access/$time" \
  --conf spark.ip.path="hdfs://hadoop000:8020/tai/access/ip.txt" \
  /home/hadoop/lib/sparksql-train-1.0.jar

Run with sh jobyarn.sh and monitor the job on the Yarn UI (port 8088).

5. Schedule the job

Use crontab (or Azkaban/Oozie) for daily execution. The cron expression for running every hour is 0 */1 * * *. For a daily run at 03:00 processing yesterday’s data:

time=`date --date='1 day ago' +%Y%m%d`
${SPARK_HOME}/bin/spark-submit \
  --class com.imooc.bigdata.cp08.SparkApp \
  --master local \
  --jars /home/hadoop/lib/kudu-client-1.7.0.jar,/home/hadoop/lib/kudu-spark2_2.11-1.7.0.jar \
  --conf spark.time=$time \
  --conf spark.raw.path="hdfs://hadoop000:8020/tai/access/$time" \
  --conf spark.ip.path="hdfs://hadoop000:8020/tai/access/ip.txt" \
  /home/hadoop/lib/sparksql-train-1.0.jar

Add the entry to crontab:

crontab -e
0 3 * * * /home/hadoop/lib/job.sh

After these steps the advertising data pipeline runs automatically each day, loading raw JSON data from HDFS, processing it with Spark, and storing results in Kudu.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data maven Spark Hadoop crontab Kudu

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.