Spark Usage in DataMagic Platform: A Practical Guide
This guide explains how DataMagic leverages Spark on YARN for fast, scalable offline analytics—covering Spark’s core role, four steps to master its terminology, configurations, parallelism, and code modification, plus practical deployment scripts, dynamic resource tuning, MongoDB export, job troubleshooting, and cluster upkeep for trillion‑record workloads.
Spark, as a big data computing engine, has rapidly dominated the field due to its speed, stability, and simplicity. This article presents the author's understanding of Spark during the construction of a computing platform, aiming to provide readers with learning insights. The content covers Spark's role in the DataMagic platform, how to quickly master Spark, and how DataMagic effectively utilizes Spark.
Spark's Role in the Platform
The main functions of the architecture include log ingestion, query (real-time and offline), and computation. The offline computing platform handles computation, using COS (internal company storage) instead of HDFS. The architecture focuses on Spark on Yarn, with the running process illustrated in the architecture diagrams.
How to Quickly Master Spark
Mastering Spark can be achieved through four key steps:
1. Understanding Spark Terminology : Learn key terms through the architecture diagram. Structural terms include Shuffle, Partitions, MapReduce, Driver, Application Master, Container, Resource Manager, and Node Manager. API programming terms include RDD and DataFrame. Understanding these terms provides insight into Spark's running principles and programming capabilities.
2. Mastering Key Configurations : Spark reads configuration from spark-defaults.conf during execution. Key configurations include spark.yarn.executor.memoryOverhead, spark.executor.memory, and spark.network.timeout. For example, spark.speculation enables speculative execution—when one worker is slow, Spark launches another to execute the same task, using whichever finishes first. However, this can cause data duplication in MySQL export scenarios.
3. Effectively Using Spark Parallelism : Spark's speed comes from parallelism. For RDD, improve parallelism by configuring num-executors, executor-cores, and spark.default.parallelism (typically 2-3 times the product of num-executors and executor-cores). For Spark SQL, configure spark.sql.shuffle.partitions, num-executors, and executor-cores.
4. Learning to Modify Spark Code : Spark has a modular structure. The directory structure shows where SQL, GraphX, and other code resides. The runtime environment is supported by JAR files. To modify a specific feature, locate the corresponding JAR's code, modify it, recompile, and replace it.
Spark in DataMagic Platform
1. Fast Deployment : Computing tasks and data volumes change daily, requiring fast deployment. One-click deployment scripts can immediately provision a physical machine with 128GB memory and 48 cores. Docker is also used to support computing resources.
2. Smart Configuration Optimization : Most Spark properties can be dynamically modified through configuration. For example, to automatically adjust executor count, add to nodeManager's yarn-site.xml:
<property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle,spark_shuffle</value></property>
<property><name>yarn.nodemanager.aux-services.spark_shuffle.class</name><value>org.apache.spark.network.yarn.YarnShuffleService</value></property>
Copy spark-2.2.0-yarn-shuffle.jar to hadoop-yarn/lib, and add to spark-defaults.xml:
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 100
3. Reasonable Resource Allocation : Different data volumes require different resources. Millions/billions of records need 20 cores, while hundreds of billions require more resources.
4. Meeting Business Requirements : To support high-concurrency, high-real-time query needs, Spark supports Cmongo for data export:
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
database = d = dict((l.split('=') for l in dbparameter.split()))
parquetFile = sqlContext.read.parquet(file_name)
parquetFile.registerTempTable(tempTable)
result = sqlContext.sql(sparksql)
url = "mongodb://"+database['user']+":"+database['password']+"@"+database['host']+":"+database['port']
result.write.format("com.mongodb.spark.sql").mode('overwrite').options(uri=url,database=database['dbname'],collection=pg_table_name).save()
5. Job Problem Diagnosis : When jobs fail, use yarn logs -applicationId application to merge logs and locate the failure cause. Common failures include: SQL syntax errors, Spark version issues with NULL handling, data skew causing long-running tasks, and memory overflow.
6. Cluster Management : Regularly check for lost/unhealthy nodes, clean expired HDFS logs, and ensure cluster resources meet computing needs.
Summary
The platform has been used for offline analysis, processing data at the trillion-level daily.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
