Big Data 14 min read

How Compute‑Storage Separation Cuts Costs and Boosts Performance for Big Data on Kubernetes

This article examines the challenges of big‑data storage in containerized environments, compares compute‑storage‑separated architectures with traditional setups, presents performance and cost benchmarks of Alibaba Cloud ECS instances, and outlines practical storage options such as OSS, NAS, and DFS for Spark workloads on Kubernetes.

Alibaba Cloud Native

Apr 9, 2019

How Compute‑Storage Separation Cuts Costs and Boosts Performance for Big Data on Kubernetes

In the follow‑up to the Spark Operator overview, the author explores the most critical issue for big data platforms—storage—highlighting three core challenges: low cost, high capacity, and fast read/write performance.

Why Compute‑Storage Separation Matters

Hardware trends show network bandwidth increasing faster than disk speed, leading to diminished local I/O advantages. Separating compute from storage reduces wasted CPU cycles, lowers storage expenses through centralized services, and aligns resource allocation with the distinct SLAs of storage (high availability) and compute (retryable tasks).

Performance and Cost Comparison of Alibaba Cloud ECS Instances

Two instance types were benchmarked using Hibench on identical I/O conditions:

ecs.ebmhfg5.2xlarge (8C 32G 6Gbps) – higher CPU frequency.

ecs.d1ne.2xlarge (8C 32G 6Gbps) – standard configuration.

Results (images omitted for brevity) show that ecs.ebmhfg5.2xlarge delivers roughly 30% better compute performance while costing about 25% less than ecs.d1ne.2xlarge, demonstrating that selecting compute‑optimized instances and off‑loading storage to remote services can yield both speed and savings.

Storage Options for Containerized Big Data

When compute and storage are decoupled, various cloud storage services can be used:

OSS (Object Storage Service) – ideal for massive small‑file workloads; can be accessed via a mounted filesystem or directly through the Spark SDK. Sample Scala code shows how to read an OSS path with Spark.

DFS (Alibaba Cloud HDFS compatible service) – provides unlimited capacity, high performance, and HDFS‑compatible APIs for existing Hadoop jobs. A simple Spark application reads a file via a dfs:// URI.

NAS (Network Attached Storage) – offers POSIX‑compatible access for scenarios where SDK integration is inconvenient. Users create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) in the container service console, then reference the PVC in the SparkOperator spec.

OSS Sample Code

package com.aliyun.emr.example
object OSSSample extends RunLocally {
  def main(args: Array[String]): Unit = {
    if (args.length < 2) {
      System.err.println("""Usage: bin/spark-submit --class OSSSample ...""")
      System.exit(1)
    }
    val inputPath = args(0)
    val numPartitions = args(1).toInt
    val ossData = sc.textFile(inputPath, numPartitions)
    println("The top 10 lines are:")
    ossData.top(10).foreach(println)
  }
  override def getAppName: String = "OSS Sample"
}

DFS Sample Code

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "dfs://f-5d68cc61ya36.cn-beijing.dfs.aliyuncs.com:10290/logdata/ab.log"
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    spark.stop()
  }
}

Configuring NAS for Spark on Kubernetes

To use NAS, create a PV and PVC in the Alibaba Cloud Container Service console, then reference the claim in the SparkApplication manifest:

apiVersion: "sparkoperator.k8s.io/v1alpha1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  restartPolicy:
    type: Never
  volumes:
  - name: pvc-nas
    persistentVolumeClaim:
      claimName: pvc-nas
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    volumeMounts:
    - name: "pvc-nas"
      mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    volumeMounts:
    - name: "pvc-nas"
      mountPath: "/tmp"

Dynamic provisioning can also be used for Kubernetes‑native storage.

Conclusion

By separating compute and storage, big‑data workloads on Kubernetes can achieve lower total cost of ownership, higher performance, and greater flexibility in choosing the most suitable storage service (OSS for cheap, high‑throughput object storage; DFS for HDFS‑compatible high‑IO workloads; NAS for POSIX‑style access). This approach enables truly cloud‑native, elastic big‑data processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes Spark Compute-Storage Separation

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.