Big Data 18 min read

Integrating Spark with MongoDB: Architecture, Use Cases, and Code Samples

This article explains how Spark can be combined with MongoDB for large‑scale data processing, covering Spark fundamentals, comparisons with HDFS, practical integration patterns, performance benefits, real‑world case studies, and detailed code examples for deployment and analytics.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Integrating Spark with MongoDB: Architecture, Use Cases, and Code Samples

MongoDB is a distributed document‑oriented database still widely used in many systems; the article suggests reading it for deeper insight.

Spark Introduction – Spark is a general‑purpose, fast engine for large‑scale data processing, offering Spark SQL, Spark Streaming, MLlib, and support for Java, Python, Scala, and R. Its in‑memory computation can be up to 100× faster than traditional MapReduce.

What Spark Can Do – From simple page‑view counting to complex machine‑learning‑driven personalization, Spark powers use cases such as Yahoo news recommendation, Comcast program recommendation, Uber real‑time order analysis, and Youku business intelligence.

Spark Ecosystem – Spark works with Hadoop components (HDFS for storage, YARN/Mesos/standalone for resource management) and provides higher‑level APIs like Hive, Pig, Spark SQL, RDD, and DataFrames.

HDFS vs. MongoDB – Both support horizontal scaling, but HDFS stores data as large files without indexing, while MongoDB stores fine‑grained documents with secondary indexes, supports CRUD operations, and offers millisecond‑level query latency.

Log Example – Demonstrates how logs stored as JSON documents in MongoDB can be indexed and queried efficiently compared to scanning large text files in HDFS.

Spark + MongoDB Integration – Describes the workflow: Spark driver creates tasks, executors fetch data from MongoDB via the Mongo‑Spark connector, process it, and write results back. The connector supports predicate push‑down to minimize data transfer.

Success Cases – Includes examples from Air France, Stratio (real‑time monitoring for a multinational bank), and China Eastern Airlines, highlighting how Spark + MongoDB improves latency and throughput for pricing, recommendation, and monitoring workloads.

Sample Spark Job Code

// initialization dependencies including base prices, pricing rules and some reference data
Map dependencies = MyDependencyManager.loadDependencies();
// broadcasting dependencies
javaSparkContext.broadcast(dependencies);
// create job rdd
cabinsRDD = MongoSpark.load(javaSparkContext).withPipeline(pipeline);
// for each cabin, date, airport pair, calculate the price
cabinsRDD.map(calc_price);
// collect the result, which will cause the data to be stored into MongoDB
cabinsRDD.collect();
cabinsRDD.saveToMongo();

Installation and Connector Test Commands

# curl -OL http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
# mkdir -p ~/spark
# tar -xvf spark-1.6.0-bin-hadoop2.6.tgz -C ~/spark --strip-components=1
# cd ~/spark
./bin/spark-shell \
--conf "spark.mongodb.input.uri=mongodb://127.0.0.1/flights.av" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/flights.output" \
--packages org.mongodb.spark:mongo-spark-connector_2.10:1.0.0
import com.mongodb.spark._
import org.bson.Document
MongoSpark.load(sc).take(10).foreach(println)

Simple Group‑By Statistics Example

MongoSpark.load(sc)
    .map(doc => (doc.getString("flight"), doc.getLong("seats")))
    .reduceByKey(_ + _)
    .take(10)
    .foreach(println)

Group‑By with Filter Example

MongoSpark.load(sc)
    .withPipeline(Seq(Document.parse("{ $match: { orig :  'KMG' } }")))
    .map(doc => (doc.getString("flight"), doc.getLong("seats")))
    .reduceByKey(_ + _)
    .take(10)
    .foreach(println)

Performance Optimizations – Use appropriate chunk sizes, reserve CPU cores for the OS, and consider co‑locating Spark and MongoDB on the same machines to reduce I/O latency.

Conclusion – Spark + MongoDB can serve a wide range of scenarios: adding personalization to existing MongoDB workloads, accelerating analytics for Hadoop‑based data, and providing a fast, scalable storage layer for processed results.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationMongoDBData Integrationdistributed computingSpark
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.