Containerizing Elasticsearch & ClickHouse on Kubernetes: Challenges & Solutions
Facing the complexities of running stateful services like Elasticsearch and ClickHouse in production, Bilibili’s infrastructure team detailed their migration to Kubernetes, describing the architectural design, custom operators, storage provisioning with LVM, network configuration, high‑availability strategies, observability, and the resulting cost, quality, and efficiency gains.
Background
In the cloud‑native era, Kubernetes has become the de‑facto standard for container orchestration, making automated operations and scheduling possible for stateless services. However, stateful services such as Elasticsearch and ClickHouse introduce geometric complexity due to inter‑instance dependencies, local storage, and high‑availability requirements.
Current Situation
Bilibili operates multiple public Elasticsearch clusters serving 2B/2C online services, internal audit, risk control, etc. Initially the clusters were stable, but as traffic grew the shared‑cluster model caused query latency spikes, cache eviction issues, and unpredictable performance across tenants.
Requirements
Strong resource isolation to prevent cross‑tenant interference.
Cluster resource orchestration that maximizes utilization while guaranteeing high availability.
Low‑maintenance or even zero‑maintenance operation.
Development cost comparable to bare‑metal deployments without performance degradation.
Technical Paths Considered
Build a self‑developed operation platform that integrates topology, resource scheduling, and orchestration.
Run Elasticsearch and ClickHouse inside containers and let Kubernetes handle scheduling and orchestration.
Overall Architecture
Key Components and Design
1. Controllers
Kubernetes controllers reconcile the actual state of resources with the desired state defined in declarative APIs. The default controllers include Deployment (stateless), ReplicaSet, StatefulSet (stateful workloads), Job/CronJob, etc. For Elasticsearch the built‑in StatefulSet is insufficient because ES nodes have multiple roles (master, data, ingest) and topology is fixed once pod names are assigned.
Therefore custom resources (CRDs) and operators were introduced. The operator watches custom resources and performs the necessary actions to create pods, PVCs, and configure the cluster.
2. Persistent Storage
High IOPS and low latency are required, so local disks (local PV) are used. Elasticsearch indices are replicated, and ClickHouse tables use ReplicatedMergeTree to ensure data availability.
Because local disks are bare devices, LVM (Logical Volume Manager) is employed to aggregate multiple physical volumes into volume groups, create logical volumes on demand, and support dynamic expansion.
3. Disk Reporting
The CSI agent reports the size and type of each volume group on a node. The scheduler uses this information to place pods on nodes with sufficient resources.
status:
allocatable:
csi.storage.io/csi-vg-hdd: "178789"
csi.storage.io/csi-vg-ssd: "3102"
capacity:
csi.storage.io/csi-vg-hdd: "178824"
csi.storage.io/csi-vg-ssd: "3112"4. Disk Scheduling
The scheduler evaluates each node’s free capacity per disk type and applies either a centralized or distributed strategy. Pods request PVCs with a specific StorageClass; the scheduler binds the PVC to a suitable node and creates the logical volume via the CSI controller.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: es-test
elasticsearch.k8s.elastic.co/node-data: "true"
topologyKey: kubernetes.io/hostname5. Network
Pods need to communicate across hosts. Bilibili uses macvlan CNI, giving each pod a unique IP reachable from inside and outside the cluster, avoiding kube‑proxy overhead.
6. Service Discovery
Headless services combined with CoreDNS provide stable DNS names for intra‑cluster communication (e.g., shard rebalancing). A unified query gateway abstracts changing pod IPs, caching the current addresses for fast read/write routing.
7. High Availability
Both cluster‑level HA (preventing split‑brain) and data‑level HA (replicas) are achieved using pod anti‑affinity on the host name, ensuring that replicas of the same role are not placed on the same physical machine.
podDistribution:
- type: ShardAntiAffinity
topologyKey: "kubernetes.io/hostname"8. Memory & I/O Isolation
Cgroups limit both RSS and page cache. Elasticsearch’s heavy use of the JVM heap and OS page cache means both must be accounted for. The kernel reclaims page cache when the cgroup exceeds its memory limit.
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages) {
if (nr_pages > 0)
__count_memcg_events(memcg, PGPGIN, 1);
else {
__count_memcg_events(memcg, PGPGOUT, 1);
nr_pages = -nr_pages;
}
__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
}9. Operator Workflow
Typical steps to create an Elasticsearch cluster:
Initialize the Kubernetes cluster (provided by the PaaS team).
Apply CRDs: kubectl create -f crds.yaml Deploy the operator: kubectl apply -f operator-bili.yaml (image reference is replaced with the internal registry).
Verify the operator pod is READY.
Create the Elasticsearch custom resource: kubectl apply -f elasticsearch.yaml Check cluster health: kubectl get elasticsearch -n elastic (expect green status).
Retrieve external IPs (macvlan provides direct access) and credentials from the generated Secret.
Observability
Metrics are exported via custom exporters that automatically discover all ES/CK clusters on the API server. Prometheus scrapes these exporters, and Grafana dashboards visualize key indicators such as cluster health, shard status, JVM heap, GC, CPU load, query QPS, and write throughput.
For ClickHouse, metrics include select/insert/alter counters and custom tables. Container‑level metrics (e.g., page cache) are collected by Cadvisor.
Productization
The solution was packaged as a private‑cloud service. Users submit a form with cluster name and specifications; the platform automatically creates, scales, or destroys clusters within a minute. Integration with CMDB, monitoring, and change‑control systems provides end‑to‑end visibility.
Benefits
Cost: Migrating >30 ES/CK clusters to Kubernetes saved over 100 bare‑metal servers and increased CPU utilization from ~5 % to ~15 %.
Quality: Isolation eliminated cross‑tenant interference; query latency spikes disappeared, improving user experience.
Efficiency: Cluster provisioning time dropped from half‑day manual effort to 1–2 minutes, and many failure scenarios are now self‑healed by the operator, reducing operational overhead.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
