Cloud Native 38 min read

Containerizing Elasticsearch and ClickHouse on Kubernetes: Architecture, Implementation, and Benefits

Bilibili migrated its Elasticsearch and ClickHouse clusters to Kubernetes using custom operators, CRDs, LVM‑based local storage, MacVLAN networking, and pod anti‑affinity, achieving higher resource utilization, isolation, and automation that cut server count, reduced latency spikes, and lowered operational costs dramatically.

Bilibili Tech

Mar 12, 2024

Containerizing Elasticsearch and ClickHouse on Kubernetes: Architecture, Implementation, and Benefits

In the cloud‑native era, Kubernetes has become the de‑facto standard for container orchestration, enabling automated operations and resource scheduling for stateless services. However, stateful services such as Elasticsearch (ES) and ClickHouse (CK) introduce additional complexity, especially regarding data persistence, resource isolation, and high availability.

This article describes Bilibili's production‑grade experience of migrating ES and CK to Kubernetes. It first analyses the existing on‑premise setup, highlighting issues such as low resource utilization, high operational cost, and the lack of multi‑tenant isolation.

Current Situation

Multiple public ES clusters share resources, causing query latency spikes when any cluster reaches peak CPU or cache usage.

Independent bare‑metal clusters improve isolation but suffer from poor utilization (often <5%).

CK clusters face similar multi‑tenant and utilization problems.

To address these challenges, two technical paths were evaluated:

Develop a custom operations platform with topology awareness, resource scheduling, and orchestration capabilities.

Containerize ES/CK and let Kubernetes handle scheduling and orchestration.

The comparison table (shown in the original article) evaluates each solution on operational cost, isolation, development effort, and resource utilization. The container‑on‑K8s approach scores high on isolation and resource efficiency while keeping operational cost low.

Overall Architecture

The solution relies on three core Kubernetes concepts:

Operators that create StatefulSet resources, provision PVC volumes, and perform tuning.

Linux Namespace, Cgroup, and rootfs to provide isolation.

MacVLAN CNI for high‑performance, host‑level networking.

Key components include a custom CRD for disk resources, a CSI agent that reports available LVM volume groups, a scheduler that matches pod requests to appropriate disks, and a storage class that provisions local PVs on demand.

Technical Details

1. Controllers

Kubernetes default controllers (Deployment, ReplicaSet, StatefulSet, Job/CronJob) are insufficient for ES/CK because they cannot express complex role‑based topologies (master, data, ingest, etc.). Custom resources (CRD) and operators are used to encode the desired state and perform reconciliation.

2. Persistent Storage

High IOPS and low latency requirements lead to the use of local disks (LVM). Data is stored on logical volumes created from volume groups of the same media type (SSD/HDD/NVMe). The following snippet shows the LVM‑based storage class configuration:

Name:                csi-sc
IsDefaultClass:      No
Annotations:        kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"csi-sc"},"mountOptions":["rw"],"parameters":{"csi.storage.io/disk-type":"nvme","csi.storage.k8s.io/fstype":"xfs"},"provisioner":"csi.storage.io","reclaimPolicy":"Delete","volumeBindingMode":"WaitForFirstConsumer"}
Provisioner:         csi.storage.io
Parameters:         csi.storage.io/disk-type=nvme,csi.storage.k8s.io/fstype=xfs
AllowVolumeExpansion: True
MountOptions:        rw
ReclaimPolicy:       Delete
VolumeBindingMode:   WaitForFirstConsumer

The WaitForFirstConsumer mode delays PVC‑PV binding until the pod is scheduled, preventing scheduling failures when using local PVs.

3. Disk Reporting

The CSI agent discovers raw disks, creates LVM physical volumes, groups them by media type, and reports capacity via a custom CRD. Example of reported capacity:

status:
  allocatable:
    csi.storage.io/csi-vg-hdd: "178789"
    csi.storage.io/csi-vg-ssd: "3102"
  capacity:
    csi.storage.io/csi-vg-hdd: "178824"
    csi.storage.io/csi-vg-ssd: "3112"

4. Disk Scheduling

The scheduler filters nodes based on the requested disk type and size, then scores them according to either a centralized or dispersed strategy. Users simply declare the desired storage class, size, and media type in a PVC; the scheduler and CSI controller handle provisioning and mounting.

5. Network

MacVLAN provides each pod with a unique L2 IP address, enabling direct host‑level communication without kube‑proxy overhead. This is essential for ES/CK internal traffic such as shard rebalance and master election.

6. Service Discovery

Headless services combined with CoreDNS allow pods to discover each other via DNS. A query gateway abstracts changing pod IPs for external clients, handling read/write routing and keeping a cached address list.

7. High Availability

Pod anti‑affinity rules ensure that replicas of the same role are not co‑located on the same host, preserving both cluster‑level and data‑level HA. Example affinity rule:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          elasticsearch.k8s.elastic.co/cluster-name: es-test
          elasticsearch.k8s.elastic.co/node-data: 'true'
      topologyKey: kubernetes.io/hostname

CK uses similar shard‑level anti‑affinity.

8. Memory & IO Isolation

Cgroups enforce memory limits, including page cache. The article includes a kernel snippet illustrating how page‑cache accounting is performed:

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages)
{
    /* pagein of a big page is an event. So, ignore page size */
    if (nr_pages > 0)
        __count_memcg_events(memcg, PGPGIN, 1);
    else {
        __count_memcg_events(memcg, PGPGOUT, 1);
        nr_pages = -nr_pages; /* for event */
    }
    __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
}

Thus, both JVM heap and page cache must be sized appropriately for ES containers.

Implementation Steps

Initialize the Kubernetes cluster via the internal PaaS team.

Install the CRDs: kubectl create -f crds.yaml Deploy the operator: kubectl apply -f operator-bili.yaml Verify the operator pod is READY: kubectl get pod -n elastic-system Create an ES cluster: kubectl apply -f elasticsearch.yaml Check cluster health: kubectl get elasticsearch -n elastic (expect green status).

Retrieve pod IPs (MacVLAN makes them reachable from outside) and the generated credentials from the secret.

CK deployment follows a similar flow using the Altinity ClickHouse operator and a custom keeper operator.

Observability

Metrics are exported via ES exporter and ClickHouse exporter, aggregated by Prometheus, and visualized in Grafana dashboards. The exporter has been extended to auto‑discover all on‑K8s clusters, supporting dynamic scaling.

Logs are currently exposed via the Kubernetes API; future work includes integration with the company‑wide logging platform.

Productization

The solution has been packaged as a self‑service product: users fill a simple form, and the platform automatically creates, scales, or deletes clusters within minutes, binding CMDB entries, monitoring, and change‑control systems. This reduces manual effort from half‑day per cluster to under two minutes.

Benefits

Cost : Migration saved >100 bare‑metal servers and raised CPU utilization from 5% to 15%.

Quality : Isolation eliminates cross‑tenant interference; query latency spikes are resolved.

Efficiency : Automated provisioning, rolling updates, and self‑healing operators enable near‑zero‑maintenance operation.

Conclusion

Running stateful services on Kubernetes demands deep knowledge of containers, networking, storage, and the underlying OS. The Bilibili experience demonstrates that with proper operators, custom resources, and automation, the trade‑offs become manageable, delivering significant cost savings, reliability, and operational efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch kubernetes operator ClickHouse LVM statefulset

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.