How Spark on Kubernetes Transformed Duodian DMALL’s Big Data Platform
The article details Duodian DMALL’s migration from a traditional Hadoop stack to a cloud‑native Spark‑on‑Kubernetes architecture, explaining the motivations, design choices, component selections, operational challenges, and lessons learned through concrete examples and performance observations.
Background
Duodian DMALL originally built its big‑data cluster with a classic Hadoop stack (HDFS, Hive, Spark, Flink, Yarn) managed by Cloudera CDH, which worked stably but revealed cost and scalability limits as storage and compute grew at different rates. Upgrading Spark versions (2.4.6, 2.3.1 to 3.x) introduced API and SQL incompatibilities, forcing the cluster to run multiple Spark versions simultaneously. CDH’s strict component version constraints prevented selective upgrades, and Yarn’s “tidal” resource pattern caused peak‑time resource shortages for Spark jobs.
Why Cloud‑Native?
To address these shortcomings, the team evaluated cloud‑native solutions and concluded that Kubernetes‑based deployment could provide a more cost‑effective, reusable, and multi‑cloud‑friendly foundation. Spark’s Kubernetes support had matured, and several open‑source tools were available to aid the transition.
Architecture Evolution
Initial Architecture
Jobs were defined in the in‑house UniData platform, parsed by TaskCenter Server, and dispatched to TaskCenter Executors that submitted Spark applications to Yarn. Hive queries were routed through Apache Kyuubi to Spark SQL. Components such as the scheduler, Hive Metastore, and Spark drivers were all deployed on physical machines, leading to complex operations and fault‑domain concerns.
Cloud‑Native Architecture
Most services and engines were migrated to Kubernetes. The scheduling system was split into Server, Executor, and Watcher modules. Executors now submit jobs as Kubernetes pods, and the Watcher tracks pod status and resource usage. Data storage moved from local disks to cloud object storage accessed via JuiceFS, eliminating the tight coupling between storage and compute. Spark shuffle was redesigned using Apache Celeborn, deployed on Kubernetes to reduce network overhead.
Key Design Decisions
Cluster Co‑location : Using Kubernetes node taints, big‑data and OLAP clusters are co‑located during low‑load periods and separated during peak hours to avoid resource contention.
Storage‑Compute Separation : Instead of porting HDFS to containers, JuiceFS provides a POSIX‑compatible layer over object storage, allowing engines to use HDFS APIs while data resides in the cloud.
Scheduler : After evaluating several Kubernetes schedulers, Volcano was chosen because it integrates with Spark 3.x, supports multiple scheduling algorithms, and offers Yarn‑like queue isolation. The team added finer‑grained pod‑affinity rules and contributed them upstream.
Job Submission : Two methods are used—native spark-submit for ad‑hoc queries (via Kyuubi) and Google Operator for batch jobs, the latter offering richer Kubernetes feature support and unified management.
Shuffle Handling : Apache Celeborn stores shuffle data on local or block storage; when thresholds are exceeded, data is pushed to HDFS (via JuiceFS) and ultimately to object storage, enabling Dynamic Allocation to work effectively.
Monitoring & Operations
Prometheus + Grafana provides a top‑down view of node and namespace resource consumption, helping administrators detect bottlenecks and optimize night‑time batch workloads. Spark application progress is observed through Spark History Server logs stored in object storage, with UI access proxied through TaskCenter to avoid exposing public endpoints. TaskCenter Watcher continuously collects resource metrics and can trigger automated remediation.
Log collection uses Fluent Bit to tail container logs, forward them to Kafka, and then asynchronously write to Elasticsearch for downstream analysis. To avoid excessive log size, the Fluent Bit configuration enables Skip_Long_Lines = On.
Elastic scaling is achieved by labeling fixed and elastic nodes differently and adjusting pod‑affinity rules at runtime, allowing the cluster to spin up elastic nodes only when fixed resources are exhausted.
Pitfalls and Resolutions
.stop() and driver termination : Early Spark‑on‑Kubernetes jobs did not exit driver pods; the issue was a Spark bug fixed in version 3.3.x.
Spark UI Service IP : Google Operator originally allocated external Service IPs for Spark UI, which conflicted with security policies. The team changed the service type to ClusterIP “none” to keep UI access internal.
Driver pod cleanup : Spark‑Submit leaves completed driver pods alive. A Kubernetes CronJob was added to delete finished pods and reclaim resources.
Custom spark.app.name values : Using non‑ASCII names caused Kubernetes naming violations. The platform now injects spark.kubernetes.executor.podNamePrefix to enforce valid names.
zstd compression errors : Switching shuffle compression from LZ4 to zstd reduced data size but caused
java.io.IOException: Decompression error: Corrupted block detectedwhen broadcasts were present. Updating the zstd‑jni dependency from 1.5.2‑1 to 1.5.4‑1 resolved the issue.
Long log lines : Extremely large log entries (>5 MB) caused downstream timeouts. Enabling Skip_Long_Lines in Fluent Bit mitigated the problem.
Future Outlook
The new architecture now runs stably across multiple public clouds, improving cluster provisioning speed and Spark performance. Planned next steps include decommissioning the legacy Hadoop‑based clusters, adopting Apache Iceberg for lake‑house capabilities, and evaluating the vectorized execution engine Blaze for further throughput gains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
