Practical Guide to Building an Advertising Project with Spark and Kudu
This article provides a step‑by‑step tutorial on deploying a Spark‑based advertising data pipeline using Kudu, covering Hadoop setup, HDFS data loading, Spark application refactoring, Maven packaging, Yarn execution, and crontab scheduling for daily automated runs.
Goal : Package and run the Spark + Kudu advertising project on a server.
1. Put data on HDFS
Start Hadoop services and create a daily directory in HDFS (YYYYMMDD format), then upload the JSON data and IP rule file.
[hadoop@hadoop000 ~]$ cd app/
[hadoop@hadoop000 app]$ ls
apache-maven-3.6.3 hive-1.1.0-cdh5.15.1 spark-2.4.5-bin-hadoop2.6
hadoop-2.6.0-cdh5.15.1 jdk1.8.0_91 tmp
[hadoop@hadoop000 app]$ cd hadoop-2.6.0-cdh5.15.1/
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ ls
bin etc include LICENSE.txt README.txt src
bin-mapreduce1 examples lib logs sbin
cloudera examples-mapreduce1 libexec NOTICE.txt share
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ cd sbin/
[hadoop@hadoop000 sbin]$ ls
distribute-exclude.sh slaves.sh stop-all.sh
hadoop-daemon.sh start-all.cmd start-all.sh
stop-balancer.sh hdfs-config.cmd start-balancer.sh
stop-dfs.cmd hdfs-config.sh start-dfs.cmd
stop-dfs.sh hdfs-daemons.sh start-dfs.sh
stop-secure-dns.sh httpfs.sh start-secure-dns.sh
stop-yarn.cmd kms.sh start-yarn.cmd
stop-yarn.sh yarn-daemon.sh mr-jobhistory-daemon.sh
[hadoop@hadoop000 sbin]$ ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [hadoop000]
... (log output omitted for brevity) ...Create the daily directory and upload files:
[hadoop@hadoop000 sbin]$ hadoop fs -mkdir -p /tai/access/20181007
[hadoop@hadoop000 sbin]$ hadoop fs -put ~/data/data-test.json /tai/access/20181007/
[hadoop@hadoop000 sbin]$ hadoop fs -put ~/data/ip.txt /tai/access/Verify the upload via the HDFS web UI (port 50070).
2. Refactor for scheduled operation
Modify the Spark application to accept runtime parameters (spark.time, spark.raw.path, spark.ip.path) and to generate table names based on the processing date.
package com.imooc.bigdata.cp08
import com.imooc.bigdata.cp08.business.{AppStatProcessor, AreaStatProcessor, LogETLProcessor, ProvinceCityStatProcessor}
import org.apache.commons.lang3.StringUtils
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
//整个项目的入口
object SparkApp extends Logging{
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[2]")
.appName("SparkApp")
.getOrCreate()
//spark-submit ... --conf time=20181007
//spark框架只认spark开头的代码
val time = spark.sparkContext.getConf.get("spark.time")
if(StringUtils.isBlank(time)){
logError("处理批次不能为空")
System.exit(0)
}
LogETLProcessor.process(spark)
ProvinceCityStatProcessor.process(spark)
AreaStatProcessor.process(spark)
AppStatProcessor.process(spark)
spark.stop()
}
}Utility for generating table names:
package com.imooc.bigdata.cp08.utils
import org.apache.spark.sql.SparkSession
object DateUtils {
def getTableName(tableName:String,spark:SparkSession)={
val time = spark.sparkContext.getConf.get("spark.time")
tableName + "_" + time
}
}Example usage in processors:
val sourceTableName = DateUtils.getTableName("ods",spark)
val sinkTableName = DateUtils.getTableName("province_city_stat",spark)Define input paths:
val rawPath = spark.sparkContext.getConf.get("spark.raw.path")
var jsonDF = spark.read.json(rawPath)
val ipRulePath = spark.sparkContext.getConf.get("spark.ip.path")
val ipRowRDD = spark.sparkContext.textFile(ipRulePath)Run the application with Spark‑submit, passing the three configuration parameters.
3. Package the code
Use Maven to build a JAR, then copy the JAR together with kudu-client-1.7.0.jar and kudu-spark2_2.11-1.7.0.jar to ~/lib on the server.
Start Spark services:
cd ~/app/spark-2.4.5-bin-hadoop2.6/sbin/
sh start-all.shCreate a local test script job.sh:
time=20181007
${SPARK_HOME}/bin/spark-submit \
--class com.imooc.bigdata.cp08.SparkApp \
--master local \
--jars /home/hadoop/lib/kudu-client-1.7.0.jar,/home/hadoop/lib/kudu-spark2_2.11-1.7.0.jar \
--conf spark.time=$time \
--conf spark.raw.path="hdfs://hadoop000:8020/tai/access/$time" \
--conf spark.ip.path="hdfs://hadoop000:8020/tai/access/ip.txt" \
/home/hadoop/lib/sparksql-train-1.0.jarExecute with sh job.sh and verify the results in the UI.
4. Run on Yarn
Create jobyarn.sh by changing --master to yarn:
time=20181007
${SPARK_HOME}/bin/spark-submit \
--class com.imooc.bigdata.cp08.SparkApp \
--master yarn \
--jars /home/hadoop/lib/kudu-client-1.7.0.jar,/home/hadoop/lib/kudu-spark2_2.11-1.7.0.jar \
--conf spark.time=$time \
--conf spark.raw.path="hdfs://hadoop000:8020/tai/access/$time" \
--conf spark.ip.path="hdfs://hadoop000:8020/tai/access/ip.txt" \
/home/hadoop/lib/sparksql-train-1.0.jarRun with sh jobyarn.sh and monitor the job on the Yarn UI (port 8088).
5. Schedule the job
Use crontab (or Azkaban/Oozie) for daily execution. The cron expression for running every hour is 0 */1 * * *. For a daily run at 03:00 processing yesterday’s data:
time=`date --date='1 day ago' +%Y%m%d`
${SPARK_HOME}/bin/spark-submit \
--class com.imooc.bigdata.cp08.SparkApp \
--master local \
--jars /home/hadoop/lib/kudu-client-1.7.0.jar,/home/hadoop/lib/kudu-spark2_2.11-1.7.0.jar \
--conf spark.time=$time \
--conf spark.raw.path="hdfs://hadoop000:8020/tai/access/$time" \
--conf spark.ip.path="hdfs://hadoop000:8020/tai/access/ip.txt" \
/home/hadoop/lib/sparksql-train-1.0.jarAdd the entry to crontab:
crontab -e
0 3 * * * /home/hadoop/lib/job.shAfter these steps the advertising data pipeline runs automatically each day, loading raw JSON data from HDFS, processing it with Spark, and storing results in Kudu.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
