Containerizing Elasticsearch & ClickHouse on Kubernetes: Bilibili’s Scalable, Low‑Cost Solution
This article details Bilibili’s journey of containerizing Elasticsearch and ClickHouse on Kubernetes, covering the challenges of stateful services, architectural decisions, custom operators, storage and network solutions, deployment steps, observability enhancements, and the resulting cost, quality, and efficiency gains.
Introduction
In the cloud‑native era, Kubernetes has become the de‑facto standard for container orchestration, making automation and resource management possible for stateless services. However, stateful services such as Elasticsearch and ClickHouse introduce complex dependencies (master‑slave roles, local‑disk data) that dramatically increase deployment difficulty. This article explains why Bilibili decided to containerize these services on Kubernetes, the challenges encountered, and the concrete technical solutions.
Current Situation and Requirements
Bilibili runs multiple public Elasticsearch clusters serving both B2B and B2C workloads. Initially the clusters were stable, but as traffic grew the shared‑cluster model caused query latency spikes, cache eviction conflicts, and poor resource isolation. Scaling out with independent bare‑metal clusters solved latency but introduced severe under‑utilization (often <5% CPU) and high operational cost.
ClickHouse faced similar multi‑tenant and resource‑utilization problems across more than 30 clusters and 500 physical nodes.
Key requirements identified were:
Strong resource isolation to prevent cross‑service interference
Efficient cluster scheduling to maximize utilization while preserving high availability
Low‑maintenance automation (ideally zero‑ops)
Minimal development effort and reuse of existing bare‑metal performance
Two technical paths were evaluated:
Build a custom operations platform with topology‑aware scheduling
Run Elasticsearch/ClickHouse inside containers and let Kubernetes handle scheduling
Solution Comparison
The table below (summarized in text) compares five approaches:
Public cluster – medium ops cost, weak isolation, low development cost, medium resource utilization.
Independent cluster – medium ops cost, strong isolation, low development cost, low utilization.
Operations platform – medium ops cost, strong isolation, high development cost, high utilization.
Component re‑work – medium ops cost, medium isolation, high development cost, medium utilization.
On K8s – low ops cost, strong isolation, medium development cost, high utilization.
Overall Architecture
The architecture relies on three core Linux primitives – Namespace, Cgroups, and rootfs – to build isolated environments. An
illustrates the flow:
The Operator creates StatefulSet objects, PVCs, and performs custom reconciliation.
Each node reports CPU, memory, and disk availability to the scheduler.
The scheduler places Pods on nodes that satisfy the requested storage class and resource constraints.
CoreDNS provides service discovery, while Macvlan supplies L2 networking for cross‑node pod communication.
Technical Details
3.1 Controllers
Kubernetes default controllers (Deployment, ReplicaSet, StatefulSet, Job/CronJob) are insufficient for Elasticsearch because node roles (master, data, ingest) cannot be expressed by a simple StatefulSet. Custom resources (CRD) and a bespoke Operator were introduced to encode role‑specific logic and perform reconciliation based on the desired state.
3.2 Persistent Storage
High IOPS and low latency requirements led to the use of local PVs backed by LVM. Data loss on host failure is mitigated by configuring Elasticsearch replicas and ClickHouse ReplicatedMergeTree tables. LVM groups disks of the same medium (SSD/HDD) into a volume group, then creates logical volumes on demand.
status:
allocatable:
csi.storage.io/csi-vg-hdd: "178789"
csi.storage.io/csi-vg-ssd: "3102"
capacity:
csi.storage.io/csi-vg-hdd: "178824"
csi.storage.io/csi-vg-ssd: "3112"3.3 Disk Reporting
The csi‑agent discovers raw disks, creates physical volumes, groups them by medium, and reports the VG sizes to the scheduler via a custom CRD.
3.4 Disk Scheduling
When a PVC is created, the scheduler selects a node whose VG has sufficient free space of the requested type. The PVC’s StorageClass defines the provisioning behavior. Example affinity rule to keep data‑node pods on different hosts:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: es-test
elasticsearch.k8s.elastic.co/node-data: "true"
topologyKey: kubernetes.io/hostname3.5 Memory Isolation
Cgroups limit both RSS and page cache. When memory.usage_in_bytes exceeds memory.limit_in_bytes, the kernel evicts page cache. This influences Elasticsearch query performance and must be considered together with JVM heap sizing.
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages)
{
if (nr_pages > 0)
__count_memcg_events(memcg, PGPGIN, 1);
else {
__count_memcg_events(memcg, PGPGOUT, 1);
nr_pages = -nr_pages;
}
__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
}3.6 Container Network
Macvlan was chosen for its L2 performance and direct external accessibility. Compared with overlay and host networking, Macvlan provides independent pod IPs and avoids kube‑proxy overhead, at the cost of IP pool management.
3.7 Service Discovery
Headless services combined with CoreDNS enable intra‑cluster name resolution. A unified query gateway proxies all client requests, dynamically updates the endpoint list, and performs read/write separation for Elasticsearch and ClickHouse.
3.8 High Availability
Both cluster‑level HA (preventing split‑brain) and data‑level HA (replicas) are achieved via Kubernetes anti‑affinity rules, ensuring that pods of the same role are not co‑located on the same host.
3.9 Implementation Steps
Install CRDs: kubectl create -f crds.yaml Deploy the operator: kubectl apply -f operator-bili.yaml (image reference: {{registry}}/cloud-on-k8s-bili:2.3.2)
Verify operator readiness: kubectl get pod -n elastic-system Create an Elasticsearch cluster: kubectl apply -f elasticsearch.yaml Check cluster health: kubectl get elasticsearch -n elastic (expect green status)
Retrieve external IPs and credentials via kubectl get pod -n elastic -o wide and
kubectl get secret test-es-elastic-user -o go-template='{{.data.elastic | base64decode}}' -n elasticObservability
Metrics are exported via custom es‑exporter and ck‑exporter running inside the cluster, feeding Prometheus and visualized in Grafana dashboards (cluster health, JVM, CPU, thread pools, QPS, etc.). Container‑level page‑cache metrics are collected by Cadvisor.
Productization
The solution was wrapped as a private‑cloud service: users fill a simple form, the platform automatically creates, scales, or deletes clusters within a minute, binds CMDB, monitoring, and change‑control systems, and provides one‑click integration for downstream services.
Benefits
Cost : saved >100 bare‑metal servers; CPU utilization rose from ~5% to ~15%.
Quality : isolation eliminated query timeouts in high‑traffic scenarios, improving user experience.
Efficiency : provisioning time dropped from half‑day manual effort to 1–2 minutes; many failure scenarios are now auto‑recovered by the operator, achieving near zero‑ops.
Conclusion
Adopting cloud‑native technologies requires cross‑domain expertise; database engineers must understand Kubernetes and vice‑versa. Bilibili’s experience shows that with a well‑designed operator, LVM‑backed local storage, and macvlan networking, stateful services can be safely migrated to Kubernetes, delivering significant cost savings and operational agility.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
