Big Data 21 min read

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

This article provides an extensive overview of Apache Spark’s ecosystem—including its data‑warehouse capabilities, ML/MLlib libraries, streaming with Spark Streaming, external frameworks, and real‑world enterprise case studies—while also noting a promotional announcement for a React Native conference.

Qunar Tech Salon

Sep 25, 2017

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

Spark has a large, growing community and an essential ecosystem for enterprise environments, offering functions needed for various production scenarios such as machine learning algorithms, log aggregation, and business intelligence.

The article introduces Spark’s core ecosystem libraries, the specific usage of ML/MLlib and Spark Streaming, and explains how Spark can serve as a core component for data‑warehouse solutions, illustrated in Figure 1.

SparkSQL supports SQL processing with DataFrames, and Spark can be integrated with distributed file systems like HDFS and S3. To build a Spark package with Hive support, use commands such as:

$ build/mvn -Pyarn -Phive -Phive-thriftserver \
-PHadoop-2.6.0 -DHadoop.version=2.6.0 \
-DskipTests clean package

For Hadoop 2.7.0 with Hive 0.13, the command changes to:

$ build/mvn -Pyarn -Phive -Phive-thriftserver \
-PHadoop-2.7.0 -DHadoop.version=2.7.0 \
-DskipTests clean package

Hive on Spark requires a Spark distribution without Hive JARs, built with:

$ ./make-distribution.sh --name Hadoop2-without-hive \
--tgz -Pyarn -PHadoop-2.6 \
-Pparquet-provided

Deploying a Spark cluster on EC2 can be done via the spark-ec2 script:

$ ./ec2/spark-ec2 -key-pair=<your key pair name> \
-identity-file=<your key pair path> \
--region=us-east-1 --zone=us-east-1a \
-hadoop-major-version=yarn \
launch hive-on-spark-cluster

The article then covers machine‑learning concepts in Spark, including DataFrames, MLlib/ML, and external libraries such as XGBoost, spark‑jobserver, and Spark Package.

Future work highlights include integrating parameter servers for model‑parallel training, with discussions on data parallelism vs. model parallelism.

Deep learning frameworks compatible with Spark—H2O, Deeplearning4j, and SparkNet—are introduced, emphasizing Spark’s in‑memory architecture for iterative ML workloads.

Enterprise use cases are presented: using Spark and Kafka to collect user activity logs, building a real‑time recommendation system with Spark Streaming, GraphX, and MLlib, and classifying Twitter bots via streaming analytics.

The article concludes by summarizing Spark’s extensive ecosystem, its applicability across data‑warehouse, ML, and streaming scenarios, and notes the upcoming React Native conference announcement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming Kafka Data Warehouse Hive Spark

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.