How Starbucks China Revamped Its Log Platform: From VMs to Cloud‑Native Kubernetes with 80% Faster Queries
Starbucks China’s logging team migrated several petabytes of logs from legacy VM‑based Elasticsearch clusters to a cloud‑native bare‑metal Kubernetes platform, upgrading ES from 7.x to 8.x, containerizing components, optimizing storage and Kafka, and achieving up to 80% query speed gains, 30% CPU reduction, and 200% write‑throughput improvement.
Background
Since 2017, Starbucks China’s log platform suffered from slow queries, high storage costs, and low availability. By 2024 the system handled multiple petabytes of data across several Elasticsearch clusters (7.8/7.9) deployed on virtual machines, with logstash producers/consumers, Kafka, and filebeat for ingestion.
Goals and Challenges
The team planned a migration from September 2024 to June 2025 to upgrade all components to Elasticsearch 8.x, move to a cloud‑native bare‑metal Kubernetes platform, and improve user experience without changing the query interface. Key challenges included mapping incompatibilities, field‑type conflicts, query context loss, duplicate alerts, and resource bottlenecks such as remote‑storage I/O limits, logstash performance caps, and shard mis‑configurations.
Execution Steps
1. Preserve User Experience During Migration
Use Cross‑Cluster Search (CCS) to allow Kibana to query both old and new clusters simultaneously.
Migrate indices one‑by‑one because the new cluster started with only nine physical nodes.
2. Resolve Data Backlog
Sample high‑volume logs at the producer side to reduce inbound traffic.
Drop logs larger than 10 MB at the producer.
Adjust Kafka partition counts (e.g., 45 partitions for hot topics) and consumer thread pools based on load testing.
3. Boost Consumer Write Capacity
Replace logstash‑consumer with Vector, which processes >2× the data on the same hardware.
Group topics for fine‑grained consumption, isolating backlogs to specific groups.
4. Enhance Storage Capability
Enable compression on Kafka (zstd) and Elasticsearch (gzip for 7.x, zstd for 8.x), reducing storage by ~40‑50%.
Optimize shard sizing (20‑40 GB per shard) and implement rollover policies.
5. Shorten Log‑Ingestion Workflow
Automate the end‑to‑end pipeline: application → topic → Vector → Elasticsearch index/template.
Introduce a ticket‑less onboarding process that creates topics, mappings, and indices automatically, cutting manual setup from ~2 hours to ~5 minutes.
Configuration Samples
fetch.max.bytes: 30000000</code><code>max.request.size: 31457280</code><code>message.max.bytes: 41943040 max_events: 2000</code><code>max_bytes: 20971520</code><code>timeout_secs: 5</code><code>compression: "zstd" "refresh_interval": "73s",</code><code>"mode": "logsdb",</code><code>"codec": "best_compression",</code><code>"routing": {"total_shards_per_node": "1"}Results
After three months of migration and validation:
Query P99 latency < 5 s (80% faster queries).
Cluster throughput increased from 450 k/s to 900 k/s (≈3×).
Write‑throughput rose 200% (producer up to 100 k/s, consumer up to 30‑50 k/s).
Storage compression saved ~50% of disk space, cutting hardware costs.
CPU usage dropped ~30% and overall stability improved, with fewer alerts and faster incident recovery.
Future Plans
Further automate log ingestion using CMDB‑driven filebeat routing and eliminate remaining logstash components.
Integrate the platform with APM and monitoring systems for a unified “log‑monitor‑alert” workflow.
Introduce LLM‑powered natural‑language search to allow business users to query logs like “show today’s dpfm interface errors”.
Key Images
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
