From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms
This article analyzes the transition from a tightly coupled storage‑compute architecture to a decoupled model, detailing how Kubernetes, Kyuubi, Celeborn, Blaze, and Hue together solve resource inefficiencies, improve scalability, and boost query performance in modern big‑data environments.
In the era of rapidly evolving big data technologies, the shift from an integrated storage‑compute architecture to a decoupled model offers new opportunities for enterprises.
Initially, the monolithic storage‑compute design simplified data movement for small workloads, but as data volumes exploded, its tight coupling caused resource inflexibility and scaling challenges.
The article outlines the pain points (node failures affecting both storage and compute, resource waste between Yarn and Impala, long Hive job runtimes) and proposes a set of questions to guide the redesign.
Adopting Kubernetes provides the foundation for storage‑compute separation, offering fine‑grained scheduling, automatic scaling, and a rich ecosystem. Namespaces isolate tenants, and resource limits ensure fair sharing.
Kyuubi serves as a multi‑tenant SQL gateway on Spark, replacing Hive‑MR and delivering 6‑10× speedups for large queries.
Celeborn acts as an external shuffle service, enabling dynamic allocation on Kubernetes and improving Spark performance by up to three times.
Blaze, a native vectorized execution engine, accelerates Spark SQL by 20‑30% with minimal configuration.
Hue provides a user‑friendly SQL editor; deploying it on Kubernetes with a single pod simplifies access to the unified query layer.
Configuration snippets illustrate how to enable Celeborn shuffle manager, Blaze, and Kyuubi in Spark, as well as a Helm‑based deployment of Hue.
# Celeborn shuffle manager configuration (when not using Blaze)
spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager
# Blaze shuffle manager configuration
spark.shuffle.manager=org.apache.spark.sql.execution.blaze.shuffle.celeborn.BlazeCelebornShuffleManager
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.celeborn.master.endpoints=xxxxx:9097
spark.shuffle.service.enabled=false
spark.celeborn.client.spark.shuffle.writer=hash
spark.celeborn.client.push.replicate.enabled=false
spark.sql.adaptive.localShuffleReader.enabled=false
spark.sql.adaptive.skewJoin.enabled=true
spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
spark.dynamicAllocation.shuffleTracking.enabled=false
spark.celeborn.quota.identity.provider=org.apache.celeborn.common.identity.HadoopBasedIdentityProviderPerformance tests show average query latency reductions of around 25 % and significant resource utilization gains.
In summary, the combined use of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue creates a cost‑effective, scalable, and high‑performance big‑data platform.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.