Big Data 11 min read

Spark and MongoDB Tutorial: Daily Active User Statistics with Scala

This tutorial guides readers through using Apache Spark and MongoDB to compute daily active user statistics, covering Spark fundamentals, a Spark‑vs‑Hadoop comparison, MongoDB use cases, environment setup, Scala code workflow, Maven compilation, and job submission on a YARN cluster.

360 Quality & Efficiency

Jun 6, 2016

Apache Spark has become one of the most popular big‑data processing engines. This article demonstrates a hands‑on practice for counting daily active users, helping readers quickly start with Spark and MongoDB, and provides source code for download.

Spark

Spark is defined as a fast, general‑purpose engine for large‑scale data processing. It runs up to 100× faster in memory and 10× faster on disk than Hadoop MapReduce, thanks to its advanced DAG execution engine that supports iterative processing and in‑memory computation.

1. Introduction and Advantages of Spark

Spark provides a distributed memory abstraction called RDD (Resilient Distributed Dataset), an immutable, partitioned collection of records. RDDs support two types of operations: Transform (lazy, creates a new RDD) and Action (triggers computation and returns results).

All Transform operations are lazily evaluated, building a directed acyclic graph (DAG) of RDD dependencies. When an Action is called, the DAG is submitted as a Job, which Spark schedules into stages, partitions, pipelines, tasks, and caches, optimizing execution.

Spark also offers two fault‑tolerance mechanisms: lineage (recomputing lost partitions) and checkpointing (persisting data to stable storage). It excels at iterative workloads because intermediate data can stay in memory.

Beyond performance, Spark provides a unified platform for batch (Spark Core), interactive (Spark SQL), streaming (Spark Streaming), machine learning (MLlib), and graph processing (GraphX), which is a major advantage over Hadoop.

2. Spark vs. Hadoop

Hadoop solves reliable storage and processing of massive data sets using HDFS for fault‑tolerant file storage and MapReduce for a simple programming model. However, MapReduce suffers from high latency, limited expressiveness, and poor performance for iterative algorithms.

Key limitations of Hadoop include low‑level APIs, only Map and Reduce operations, high job overhead, lack of interactive capabilities, and poor support for iterative processing.

MongoDB

MongoDB is a NoSQL database that offers high write throughput, easy horizontal scaling, and flexible schema, making it suitable for large‑scale, unstructured, or location‑based data. It lacks strong transactional guarantees and joins, so it should be used when those features are not required.

Typical scenarios for MongoDB include high‑write workloads, high availability via replica sets, easy sharding for large data volumes, geospatial queries, rapid schema evolution, and environments without dedicated DBAs.

Daily Active User Statistics Example

1. Environment Preparation

Tool

Version

URL

Spark

1.4.1

http://spark.apache.org/docs/1.4.1/

MongoDB

2.6.x

https://www.mongodb.com/download-center

MongoDB Hadoop Connector

1.5.1

https://github.com/mongodb/mongo-hadoop/releases

MongoDB Java Driver

3.2.2

http://mongodb.github.io/mongo-java-driver/

Scala IDE

4.3.0

http://scala-ide.org/

2. Spark Language Choice

Spark supports Scala, Java, Python, and R. This tutorial uses Scala because it is multi‑paradigm, integrates with Java libraries, and offers concise syntax.

3. Code Flow

Create a Configuration object to set MongoDB‑Hadoop connector parameters.

Instantiate a SparkContext and use newAPIHadoopRDD with the appropriate InputFormat to obtain an RDD.

Apply Transform and Action operations on the RDD, then persist results with saveAsNewAPIHadoopFile.

4. Scala Code for Daily Active User Counting

Constants.scala (image)

ActiveUserOnDay.scala (image)

Additional supporting images are included in the original article.

5. Maven Compilation of Scala Code

Using Scala IDE, create a new project, add the Scala files, and define a POM file (images shown). Build with clean package -P scala-2.10 via Maven.

6. Submitting the Spark Job

Upload the compiled JAR to the Spark client machine and run the following shell command (image):

The job runs on a YARN cluster; after a short wait the output contains the daily active user statistics.

References

Scala basic syntax: http://www.yiibai.com/scala/scala_basic_syntax.html

Spark documentation: http://spark.apache.org/docs/1.4.1/

Source code download: https://yunpan.cn/cRgp8LNGX6BWC (extraction code: ffc9)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data processing MongoDB Spark Scala daily active users

Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.