Cloud Native 32 min read

Containerizing Stateful Services on Kubernetes: Challenges, Solutions, and Best Practices

This article examines the difficulties of running stateful services such as Redis, etcd, and MySQL on Kubernetes and presents practical solutions—including workload selection, CRD/operator extensions, scheduling strategies, high‑availability mechanisms, performance‑optimized networking and storage, and chaos‑engineering validation—to achieve reliable, high‑performance containerized deployments.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Containerizing Stateful Services on Kubernetes: Challenges, Solutions, and Best Practices

Background

As Kubernetes becomes the dominant cloud‑native platform, many traditional services are being migrated from VMs or physical machines to Kubernetes. While most Kubernetes deployments are stateless, stateful services pose additional challenges such as resource isolation, fine‑grained access control, and performance requirements. This article uses a Redis cluster (Codis) as a concrete example to explore common pain points and solutions for containerizing stateful workloads.

Stateful Service Containerization Challenges

The article outlines the key questions to address when containerizing stateful services: describing the service in Kubernetes terms, choosing the appropriate workload, extending Kubernetes when built‑in workloads are insufficient, performing safe updates, ensuring pods are scheduled across failure domains, handling pod failures, and meeting high network and storage performance demands.

Workload Types

Kubernetes provides several built‑in workloads such as Pod , Deployment , and StatefulSet . Pods are the smallest scheduling unit and can host sidecar containers for auxiliary tasks. Deployments are suited for stateless components but lack stable identities and ordered updates, making them unsuitable for stateful services. StatefulSet offers stable network identities, persistent storage, and ordered rolling updates, which are essential for services like etcd, Zookeeper, and Redis.

Extension Mechanisms

Kubernetes offers a rich extension ecosystem, including CRD , Aggregated API Server , custom schedulers, and operators. When built‑in workloads cannot meet specific requirements, developers can define custom resources (CRDs) and implement controllers/operators to manage the lifecycle of complex stateful applications.

Enhanced Workloads

Enhanced workloads such as Tencent's StatefulSetPlus , tkestack TAPP , and the open‑source OpenKruise (which provides CloneSet, Advanced StatefulSet, SideCarSet, etc.) add features like in‑place updates, fixed IPs, HPA support, and more granular rollout control.

Operator‑Based Extension

By defining a CRD that represents a complete Codis cluster, an operator can watch for create/update/delete events and reconcile the desired state by creating the necessary Deployments, StatefulSets, and other components. The operator follows the typical controller pattern: List → Watch → Queue → Reconcile.

Scheduling

Ensuring that equivalent pods (e.g., master‑replica pairs) are spread across failure domains is achieved using Kubernetes affinity and anti‑affinity rules. The article provides an example anti‑affinity configuration for an etcd cluster:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: etcd_cluster
          operator: In
          values: ["etcd-test"]
      topologyKey: failure-domain.beta.kubernetes.io/zone

When built‑in scheduling is insufficient, custom predicates, priorities, or entirely separate schedulers can be employed.

High Availability

Stateful services require robust HA mechanisms. The article discusses three replication models: master‑slave replication (synchronous, asynchronous, semi‑synchronous), decentralized replication (quorum reads/writes), and consensus algorithms such as Raft/Paxos. It emphasizes that container‑level HA (e.g., pod self‑healing) must be complemented by service‑level HA logic to avoid data loss or split‑brain scenarios.

Performance

High performance for stateful services depends on both networking and storage. Kubernetes supports extensible CNI plugins (underlay vs. overlay) and provides examples like Flannel (UDP/VXLAN/host‑gw) and Tencent's TKE network modes (global route, VPC‑CNI, pod‑exclusive NIC). For storage, the PV/PVC model, StorageClass, and CSI plugins enable flexible provisioning of local disks, cloud disks, and network file systems. Code examples for PVC, PV, and StorageClass definitions are included.

apiVersion: v1
kind: PersistentVolumeClaim
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: cbs
apiVersion: v1
kind: PersistentVolume
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  persistentVolumeReclaimPolicy: Delete
  qcloudCbs:
    cbsDiskId: disk-r1z73a3x
  storageClassName: cbs
  volumeMode: Filesystem
apiVersion: storage.k8s.io/v1
kind: StorageClass
parameters:
  type: cbs
provisioner: cloud.tencent.com/qcloud-cbs
reclaimPolicy: Delete
volumeBindingMode: Immediate

Chaos Engineering

To validate the stability of containerized stateful services, the article recommends using chaos‑engineering tools such as Chaos Mesh to inject pod, network, and I/O failures, helping uncover bugs in operators and underlying Kubernetes components.

Conclusion

The article summarizes that successful containerization of stateful services requires careful workload selection, extension via CRDs/operators, advanced scheduling, robust HA mechanisms, performance‑optimized networking and storage, and thorough chaos‑engineering testing to ensure reliability and competitiveness.

High AvailabilitykubernetesOperatorstorageNetworkingstateful servicesCRD
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.