Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases
This article provides an extensive overview of Apache Spark’s ecosystem—including its data‑warehouse capabilities, ML/MLlib libraries, streaming with Spark Streaming, external frameworks, and real‑world enterprise case studies—while also noting a promotional announcement for a React Native conference.
Spark has a large, growing community and an essential ecosystem for enterprise environments, offering functions needed for various production scenarios such as machine learning algorithms, log aggregation, and business intelligence.
The article introduces Spark’s core ecosystem libraries, the specific usage of ML/MLlib and Spark Streaming, and explains how Spark can serve as a core component for data‑warehouse solutions, illustrated in Figure 1.
SparkSQL supports SQL processing with DataFrames, and Spark can be integrated with distributed file systems like HDFS and S3. To build a Spark package with Hive support, use commands such as:
$ build/mvn -Pyarn -Phive -Phive-thriftserver \
-PHadoop-2.6.0 -DHadoop.version=2.6.0 \
-DskipTests clean packageFor Hadoop 2.7.0 with Hive 0.13, the command changes to:
$ build/mvn -Pyarn -Phive -Phive-thriftserver \
-PHadoop-2.7.0 -DHadoop.version=2.7.0 \
-DskipTests clean packageHive on Spark requires a Spark distribution without Hive JARs, built with:
$ ./make-distribution.sh --name Hadoop2-without-hive \
--tgz -Pyarn -PHadoop-2.6 \
-Pparquet-providedDeploying a Spark cluster on EC2 can be done via the spark-ec2 script:
$ ./ec2/spark-ec2 -key-pair=<your key pair name> \
-identity-file=<your key pair path> \
--region=us-east-1 --zone=us-east-1a \
-hadoop-major-version=yarn \
launch hive-on-spark-clusterThe article then covers machine‑learning concepts in Spark, including DataFrames, MLlib/ML, and external libraries such as XGBoost, spark‑jobserver, and Spark Package.
Future work highlights include integrating parameter servers for model‑parallel training, with discussions on data parallelism vs. model parallelism.
Deep learning frameworks compatible with Spark—H2O, Deeplearning4j, and SparkNet—are introduced, emphasizing Spark’s in‑memory architecture for iterative ML workloads.
Enterprise use cases are presented: using Spark and Kafka to collect user activity logs, building a real‑time recommendation system with Spark Streaming, GraphX, and MLlib, and classifying Twitter bots via streaming analytics.
The article concludes by summarizing Spark’s extensive ecosystem, its applicability across data‑warehouse, ML, and streaming scenarios, and notes the upcoming React Native conference announcement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
