Big Data 18 min read

Spark Introduction and Integration with MongoDB: Architecture, Use Cases, and Code Samples

This article introduces Apache Spark as a fast, general‑purpose big‑data engine, explains its ecosystem, compares HDFS with MongoDB, and demonstrates how Spark can be combined with MongoDB through the Mongo‑Spark connector, including real‑world case studies and sample code.

Architecture Digest

Sep 17, 2016

Spark Introduction and Integration with MongoDB: Architecture, Use Cases, and Code Samples

According to the official definition, Apache Spark is a general‑purpose, fast engine for large‑scale data processing.

Its versatility comes from Spark SQL for analytics, Spark Streaming for real‑time data, MLlib for machine learning, and support for Java, Python, Scala, and R.

Speed is achieved by in‑memory computation, allowing iterative algorithms to run up to 100× faster than traditional MapReduce.

Spark scales horizontally on HDFS and can elastically add compute nodes.

Typical use cases range from simple page‑click counting to complex machine‑learning‑driven personalization, such as Yahoo’s news recommendation, Comcast’s program recommendation, Uber’s real‑time order analysis, and Youku’s BI upgrades.

In the Hadoop ecosystem Spark sits alongside HDFS for storage and YARN/Mesos for resource management, or can run in standalone mode.

Compared with HDFS, MongoDB offers document‑level storage, secondary indexes, fast CRUD operations, and millisecond‑level response times.

For example, a log‑analysis task that would require full scans on HDFS can be answered in seconds on MongoDB using indexed queries.

Spark + MongoDB Architecture

Spark drivers submit jobs to a Spark master, which schedules work across multiple worker nodes. Each executor reads raw data from MongoDB, applies Spark transformations, and writes results back to MongoDB via the Mongo‑Spark connector.

The connector supports predicate push‑down, so filters like "errorCode=404" are executed on MongoDB, dramatically reducing data transferred to Spark.

It also allows co‑location of Spark and MongoDB on the same node to cut network latency, provided resources are isolated (e.g., via cgroups).

Success Cases

Air France uses Spark to classify customer data stored in MongoDB for a 360° view; Stratio built a real‑time monitoring platform for a multinational bank using Apache Flume, Spark, and MongoDB; Eastern Airlines replaced a real‑time pricing engine with a Spark‑MongoDB batch pipeline, achieving tens of thousands of writes per second and sub‑10 ms query latency.

Eastern Airlines Solution

Historical fare data (billions of records) are pre‑computed daily by Spark, stored in MongoDB, and served via fast key‑value lookups, reducing average response time from hundreds of milliseconds to about 10 ms.

Spark Job Entry Example

# curl -OL http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
# mkdir -p ~/spark
# tar -xvf spark-1.6.0-bin-hadoop2.6.tgz -C ~/spark --strip-components=1

Testing the connector:

# cd ~/spark
# ./bin/spark-shell \
    --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/flights.av" \
    --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/flights.output" \
    --packages org.mongodb.spark:mongo-spark-connector_2.10:1.0.0

import com.mongodb.spark._
import org.bson.Document

MongoSpark.load(sc).take(10).foreach(println)

Simple group‑by statistics:

MongoSpark.load(sc)
  .map(doc => (doc.getString("flight"), doc.getLong("seats")))
  .reduceByKey(_ + _)
  .take(10)
  .foreach(println)

Group‑by with a filter (e.g., only flights originating from KMG):

import org.bson.Document

MongoSpark.load(sc)
  .withPipeline(Seq(Document.parse("{ $match: { orig : 'KMG' } }")))
  .map(doc => (doc.getString("flight"), doc.getLong("seats")))
  .reduceByKey(_ + _)
  .take(10)
  .foreach(println)

Performance Optimization Tips

Choose an appropriate chunk size (MB) so that total data size / chunk size equals the number of RDD partitions.

Reserve 1–2 CPU cores for the OS and other processes; don’t allocate all cores to Spark.

Consider co‑locating Spark and MongoDB on the same machine to reduce I/O latency.

Summary

Spark provides a fast, general‑purpose engine for batch, streaming, and machine‑learning workloads; MongoDB offers a flexible, low‑latency storage layer. Together they enable real‑time analytics, personalized recommendations, and large‑scale batch processing, as demonstrated by multiple industry case studies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data real-time analytics Connector data processing MongoDB Spark

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.