Spark and MongoDB Tutorial: Daily Active User Statistics with Scala
This tutorial guides readers through using Apache Spark and MongoDB to compute daily active user statistics, covering Spark fundamentals, a Spark‑vs‑Hadoop comparison, MongoDB use cases, environment setup, Scala code workflow, Maven compilation, and job submission on a YARN cluster.
Apache Spark has become one of the most popular big‑data processing engines. This article demonstrates a hands‑on practice for counting daily active users, helping readers quickly start with Spark and MongoDB, and provides source code for download.
Spark
Spark is defined as a fast, general‑purpose engine for large‑scale data processing. It runs up to 100× faster in memory and 10× faster on disk than Hadoop MapReduce, thanks to its advanced DAG execution engine that supports iterative processing and in‑memory computation.
1. Introduction and Advantages of Spark
Spark provides a distributed memory abstraction called RDD (Resilient Distributed Dataset), an immutable, partitioned collection of records. RDDs support two types of operations: Transform (lazy, creates a new RDD) and Action (triggers computation and returns results).
All Transform operations are lazily evaluated, building a directed acyclic graph (DAG) of RDD dependencies. When an Action is called, the DAG is submitted as a Job, which Spark schedules into stages, partitions, pipelines, tasks, and caches, optimizing execution.
Spark also offers two fault‑tolerance mechanisms: lineage (recomputing lost partitions) and checkpointing (persisting data to stable storage). It excels at iterative workloads because intermediate data can stay in memory.
Beyond performance, Spark provides a unified platform for batch (Spark Core), interactive (Spark SQL), streaming (Spark Streaming), machine learning (MLlib), and graph processing (GraphX), which is a major advantage over Hadoop.
2. Spark vs. Hadoop
Hadoop solves reliable storage and processing of massive data sets using HDFS for fault‑tolerant file storage and MapReduce for a simple programming model. However, MapReduce suffers from high latency, limited expressiveness, and poor performance for iterative algorithms.
Key limitations of Hadoop include low‑level APIs, only Map and Reduce operations, high job overhead, lack of interactive capabilities, and poor support for iterative processing.
MongoDB
MongoDB is a NoSQL database that offers high write throughput, easy horizontal scaling, and flexible schema, making it suitable for large‑scale, unstructured, or location‑based data. It lacks strong transactional guarantees and joins, so it should be used when those features are not required.
Typical scenarios for MongoDB include high‑write workloads, high availability via replica sets, easy sharding for large data volumes, geospatial queries, rapid schema evolution, and environments without dedicated DBAs.
Daily Active User Statistics Example
1. Environment Preparation
Tool
Version
URL
Spark
1.4.1
http://spark.apache.org/docs/1.4.1/
MongoDB
2.6.x
https://www.mongodb.com/download-center
MongoDB Hadoop Connector
1.5.1
https://github.com/mongodb/mongo-hadoop/releases
MongoDB Java Driver
3.2.2
http://mongodb.github.io/mongo-java-driver/
Scala IDE
4.3.0
http://scala-ide.org/
2. Spark Language Choice
Spark supports Scala, Java, Python, and R. This tutorial uses Scala because it is multi‑paradigm, integrates with Java libraries, and offers concise syntax.
3. Code Flow
Create a Configuration object to set MongoDB‑Hadoop connector parameters.
Instantiate a SparkContext and use newAPIHadoopRDD with the appropriate InputFormat to obtain an RDD.
Apply Transform and Action operations on the RDD, then persist results with saveAsNewAPIHadoopFile.
4. Scala Code for Daily Active User Counting
Constants.scala (image)
ActiveUserOnDay.scala (image)
Additional supporting images are included in the original article.
5. Maven Compilation of Scala Code
Using Scala IDE, create a new project, add the Scala files, and define a POM file (images shown). Build with clean package -P scala-2.10 via Maven.
6. Submitting the Spark Job
Upload the compiled JAR to the Spark client machine and run the following shell command (image):
The job runs on a YARN cluster; after a short wait the output contains the daily active user statistics.
References
Scala basic syntax: http://www.yiibai.com/scala/scala_basic_syntax.html
Spark documentation: http://spark.apache.org/docs/1.4.1/
Source code download: https://yunpan.cn/cRgp8LNGX6BWC (extraction code: ffc9)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
