Cloud Native 14 min read

Overview of XDF Local Storage Service (xlss) Architecture, Components, and Disaster Recovery Workflow

The article introduces xlss, a high‑performance, highly‑available Kubernetes local storage solution, details its core components, application scenarios, custom scheduler design, backup and recovery processes, and provides code snippets and CRD examples for implementing resilient stateful workloads.

New Oriental Technology
New Oriental Technology
New Oriental Technology
Overview of XDF Local Storage Service (xlss) Architecture, Components, and Disaster Recovery Workflow

1. xlss Introduction

xlss (XDF Local Storage Service) is a high‑performance, highly‑available local storage solution for Kubernetes, designed to address the limitations of native localpv such as static provisioning, node‑affinity constraints, and data‑loss risk.

2. Application Scenarios

xlss combines the high availability of remote storage with the performance of local storage, suitable for high‑IO applications (e.g., Kafka), dynamic local storage management, encrypted data backup, and storage resource monitoring/alerting.

3. xlss in Kubernetes

4. Main Components

xlss consists of three core components:

1. xlss‑scheduler

Custom scheduler based on kube‑scheduler.

Stateless pod scheduling identical to kube‑scheduler.

Stateful pod scheduling that automatically detects xlss localpv usage and intervenes to remove node‑affinity constraints.

2. xlss‑rescuer

Runs as a DaemonSet.

Executes backup jobs according to backup policies.

Monitors restore requests and performs data recovery.

Exposes metrics for backup, restore, and xlss‑used PV resources.

3. xlss‑localpv‑provisioner

Dynamically provisions local PVs.

Currently leverages the OpenEBS localpv hostpath engine.

5. Local Storage Disaster Recovery Workflow

5.1 Workflow Diagram

5.2 Detailed Steps

The process is divided into six stages:

1. Data Backup

Backup jobs are triggered by annotations on Pods, StatefulSets, or custom Operators, with a backup switch to enable the process.

2. Node Exception

When a node becomes unhealthy, alerts are sent, and operators assess the situation; if the node cannot be recovered quickly, stateful services require intervention.

3. Delete Terminating Pod

If a pod remains in the Terminating state beyond the eviction timeout, it must be force‑deleted to allow recreation, e.g.:

kubectl delete pod myapp-1 -n test --force --grace-period=0

4. Intelligent Scheduling

Stateful pods automatically detect unhealthy nodes, receive a special mark, have their node‑affinity removed, and then a restore request is sent.

5. Data Recovery

Upon a restore request, the latest backup snapshot is restored to a healthy node; recovery can occur with or without existing backup data.

6. Pod Startup

After PV creation and data preparation, the stateful pod starts its application services.

6. Key Technical Analysis

6.1 xlss‑scheduler Design

The scheduler extends the Kubernetes scheduling framework with custom plugins and three custom extension points.

6.1.1 Scheduling Framework

The framework provides a plugin architecture for customizing scheduling decisions.

6.1.2 Extension Points Overview

Key extension points include QueueSort, PreFilter, Filter, PreScore, Score, NormalizeScore, Reserve, Permit, PreBind, Bind, PostBind, and Unreserve.

6.1.3 Custom Extension Points

xlss‑scheduler implements three custom points:

PreFilter – marks stateful pods using xlss localpv when the original node is unhealthy.

Filter – removes node‑affinity for marked pods.

PreBind – deletes the mark and sends a data‑restore request.

PreFilter code snippet:

if condition.Type == v1.NodeReady && condition.Status != v1.ConditionTrue {
    // Make a mark on the pod.
    if pod.Annotations == nil {
        newAnnotations := map[string]string{POD_MARK_KEY: POD_MARK_VALUE}
        pod.Annotations = newAnnotations
    } else {
        pod.Annotations[POD_MARK_KEY] = POD_MARK_VALUE
    }
    return nil
}

CRD example for a restore request:

apiVersion: localpv.k8s.arch.xdf.cn/v1
kind: PvRestore
metadata:
  name: kafka-kafka-openebs-0-6jn79-localpv-pr
  namespace: kafka
spec:
  nodeName: 172.24.248.14
  namespace: kafka
  podName: kafka-openebs-0-6jn79
  volumes:
    - pvc-f8910594-088e-4107-b6b1-8a1d58793bf5

6.2 Data Backup and Recovery Design

6.2.1 Backup Job Workflow

Backup jobs are managed via a Kubernetes informer that watches Pod changes and extracts custom annotations to schedule backups.

6.2.2 Backup Strategy Switch

Both user‑defined and system‑default backup strategies exist, with user settings taking precedence; each strategy has an enable/disable switch.

6.2.3 Recovery Job Workflow

Recovery jobs also rely on informers that watch CRD changes and trigger fast data restoration.

7. Conclusion

xlss follows cloud‑native design principles, with its three components—xlss‑scheduler, xlss‑rescuer, and xlss‑localpv‑provisioner—collaborating to provide a resilient local‑storage disaster‑recovery solution that reduces network latency, improves performance for stateful services, and expands high‑availability options in cloud‑native environments.

8. References

https://www.qikqiak.com/post/custom-kube-scheduler/

https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/

cloud nativeKubernetesSchedulerdisaster recoverydata backupLocal Storage
New Oriental Technology
Written by

New Oriental Technology

Practical internet development experience, tech sharing, knowledge consolidation, and forward-thinking insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.