How Compute‑Storage Separation Cuts Costs and Boosts Performance for Big Data on Kubernetes
This article examines the challenges of big‑data storage in containerized environments, compares compute‑storage‑separated architectures with traditional setups, presents performance and cost benchmarks of Alibaba Cloud ECS instances, and outlines practical storage options such as OSS, NAS, and DFS for Spark workloads on Kubernetes.
In the follow‑up to the Spark Operator overview, the author explores the most critical issue for big data platforms—storage—highlighting three core challenges: low cost, high capacity, and fast read/write performance.
Why Compute‑Storage Separation Matters
Hardware trends show network bandwidth increasing faster than disk speed, leading to diminished local I/O advantages. Separating compute from storage reduces wasted CPU cycles, lowers storage expenses through centralized services, and aligns resource allocation with the distinct SLAs of storage (high availability) and compute (retryable tasks).
Performance and Cost Comparison of Alibaba Cloud ECS Instances
Two instance types were benchmarked using Hibench on identical I/O conditions:
ecs.ebmhfg5.2xlarge (8C 32G 6Gbps) – higher CPU frequency.
ecs.d1ne.2xlarge (8C 32G 6Gbps) – standard configuration.
Results (images omitted for brevity) show that ecs.ebmhfg5.2xlarge delivers roughly 30% better compute performance while costing about 25% less than ecs.d1ne.2xlarge, demonstrating that selecting compute‑optimized instances and off‑loading storage to remote services can yield both speed and savings.
Storage Options for Containerized Big Data
When compute and storage are decoupled, various cloud storage services can be used:
OSS (Object Storage Service) – ideal for massive small‑file workloads; can be accessed via a mounted filesystem or directly through the Spark SDK. Sample Scala code shows how to read an OSS path with Spark.
DFS (Alibaba Cloud HDFS compatible service) – provides unlimited capacity, high performance, and HDFS‑compatible APIs for existing Hadoop jobs. A simple Spark application reads a file via a dfs:// URI.
NAS (Network Attached Storage) – offers POSIX‑compatible access for scenarios where SDK integration is inconvenient. Users create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) in the container service console, then reference the PVC in the SparkOperator spec.
OSS Sample Code
package com.aliyun.emr.example
object OSSSample extends RunLocally {
def main(args: Array[String]): Unit = {
if (args.length < 2) {
System.err.println("""Usage: bin/spark-submit --class OSSSample ...""")
System.exit(1)
}
val inputPath = args(0)
val numPartitions = args(1).toInt
val ossData = sc.textFile(inputPath, numPartitions)
println("The top 10 lines are:")
ossData.top(10).foreach(println)
}
override def getAppName: String = "OSS Sample"
}DFS Sample Code
/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "dfs://f-5d68cc61ya36.cn-beijing.dfs.aliyuncs.com:10290/logdata/ab.log"
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}Configuring NAS for Spark on Kubernetes
To use NAS, create a PV and PVC in the Alibaba Cloud Container Service console, then reference the claim in the SparkApplication manifest:
apiVersion: "sparkoperator.k8s.io/v1alpha1"
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
restartPolicy:
type: Never
volumes:
- name: pvc-nas
persistentVolumeClaim:
claimName: pvc-nas
driver:
cores: 0.1
coreLimit: "200m"
memory: "512m"
volumeMounts:
- name: "pvc-nas"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
volumeMounts:
- name: "pvc-nas"
mountPath: "/tmp"Dynamic provisioning can also be used for Kubernetes‑native storage.
Conclusion
By separating compute and storage, big‑data workloads on Kubernetes can achieve lower total cost of ownership, higher performance, and greater flexibility in choosing the most suitable storage service (OSS for cheap, high‑throughput object storage; DFS for HDFS‑compatible high‑IO workloads; NAS for POSIX‑style access). This approach enables truly cloud‑native, elastic big‑data processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
