Cloud Native 25 min read

How Starbucks China Revamped Its Log Platform: From VMs to Cloud‑Native Kubernetes with 80% Faster Queries

Starbucks China’s logging team migrated several petabytes of logs from legacy VM‑based Elasticsearch clusters to a cloud‑native bare‑metal Kubernetes platform, upgrading ES from 7.x to 8.x, containerizing components, optimizing storage and Kafka, and achieving up to 80% query speed gains, 30% CPU reduction, and 200% write‑throughput improvement.

dbaplus Community
dbaplus Community
dbaplus Community
How Starbucks China Revamped Its Log Platform: From VMs to Cloud‑Native Kubernetes with 80% Faster Queries

Background

Since 2017, Starbucks China’s log platform suffered from slow queries, high storage costs, and low availability. By 2024 the system handled multiple petabytes of data across several Elasticsearch clusters (7.8/7.9) deployed on virtual machines, with logstash producers/consumers, Kafka, and filebeat for ingestion.

Goals and Challenges

The team planned a migration from September 2024 to June 2025 to upgrade all components to Elasticsearch 8.x, move to a cloud‑native bare‑metal Kubernetes platform, and improve user experience without changing the query interface. Key challenges included mapping incompatibilities, field‑type conflicts, query context loss, duplicate alerts, and resource bottlenecks such as remote‑storage I/O limits, logstash performance caps, and shard mis‑configurations.

Execution Steps

1. Preserve User Experience During Migration

Use Cross‑Cluster Search (CCS) to allow Kibana to query both old and new clusters simultaneously.

Migrate indices one‑by‑one because the new cluster started with only nine physical nodes.

2. Resolve Data Backlog

Sample high‑volume logs at the producer side to reduce inbound traffic.

Drop logs larger than 10 MB at the producer.

Adjust Kafka partition counts (e.g., 45 partitions for hot topics) and consumer thread pools based on load testing.

3. Boost Consumer Write Capacity

Replace logstash‑consumer with Vector, which processes >2× the data on the same hardware.

Group topics for fine‑grained consumption, isolating backlogs to specific groups.

4. Enhance Storage Capability

Enable compression on Kafka (zstd) and Elasticsearch (gzip for 7.x, zstd for 8.x), reducing storage by ~40‑50%.

Optimize shard sizing (20‑40 GB per shard) and implement rollover policies.

5. Shorten Log‑Ingestion Workflow

Automate the end‑to‑end pipeline: application → topic → Vector → Elasticsearch index/template.

Introduce a ticket‑less onboarding process that creates topics, mappings, and indices automatically, cutting manual setup from ~2 hours to ~5 minutes.

Configuration Samples

fetch.max.bytes: 30000000</code><code>max.request.size: 31457280</code><code>message.max.bytes: 41943040
max_events: 2000</code><code>max_bytes: 20971520</code><code>timeout_secs: 5</code><code>compression: "zstd"
"refresh_interval": "73s",</code><code>"mode": "logsdb",</code><code>"codec": "best_compression",</code><code>"routing": {"total_shards_per_node": "1"}

Results

After three months of migration and validation:

Query P99 latency < 5 s (80% faster queries).

Cluster throughput increased from 450 k/s to 900 k/s (≈3×).

Write‑throughput rose 200% (producer up to 100 k/s, consumer up to 30‑50 k/s).

Storage compression saved ~50% of disk space, cutting hardware costs.

CPU usage dropped ~30% and overall stability improved, with fewer alerts and faster incident recovery.

Future Plans

Further automate log ingestion using CMDB‑driven filebeat routing and eliminate remaining logstash components.

Integrate the platform with APM and monitoring systems for a unified “log‑monitor‑alert” workflow.

Introduce LLM‑powered natural‑language search to allow business users to query logs like “show today’s dpfm interface errors”.

Key Images

Original architecture
Original architecture
Upgraded architecture
Upgraded architecture
CCS migration flow
CCS migration flow
Storage compression comparison
Storage compression comparison
Performance metrics
Performance metrics
Performance optimizationdata pipelinelog platform
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.