How to Build a Scalable, Low‑Cost Log Platform for Massive Data Volumes
This article details the design and implementation of a unified log platform that handles peak write rates of up to one million events per second, balances performance, stability, and cost, and leverages Filebeat, Kafka, Flink, and Elasticsearch across multi‑cloud environments.
Background
Daily Fresh's early log and monitoring systems were fragmented, with only a few business lines having dedicated solutions (mlist for the transaction platform, EFK for the search team). Most services accessed logs by logging directly into servers, which became unsustainable as the business rapidly expanded, prompting the need for a unified, large‑scale log platform.
Architecture Design
Core Challenges
Performance : During major sales events, log write peaks reach ~600k/s; the target is to sustain one million writes per second while supporting sub‑second queries on 30 billion hot logs (three‑day window) for 99% of queries.
Stability : The system must remain stable under bursty writes, ensure data integrity when components fail, and prevent Elasticsearch from crashing under massive queries.
Cost : Elasticsearch is fast but expensive; storing petabytes of log data for long periods demands cost‑effective storage solutions.
Operations Management : Over 400 applications and 3,000+ servers require automated maintenance of components such as ES, collectors, and deployment pipelines.
Cross‑cloud Deployment : Services run on UC and TC clouds; the architecture must minimize dedicated‑line bandwidth while supporting cross‑cloud queries.
Given log characteristics (append‑only, time‑ordered, write‑heavy, read‑light), the design prioritizes fast queries and low cost, while sacrificing strong consistency, exact precision, and ultra‑high QPS queries.
Technical Choices
Collection Layer : Filebeat (written in Go) was chosen over Logstash and Flume for its lightweight footprint.
Buffer Layer : Kafka serves as the primary buffer to absorb traffic spikes; RocketMQ is considered for more sensitive workloads.
Storage Layer : Elasticsearch is retained as the main store due to existing expertise and ecosystem support.
Processing Layer : Flink was selected over Storm for its higher throughput (≈5×), stream‑native architecture, and modern ecosystem.
Visualization : A custom log‑query UI replaces Kibana to address its complex UI, limited functionality, and lack of query optimization.
Architecture Overview
Filebeat collects logs and pushes them to Kafka; Flink processes Kafka streams in real time and writes results to Elasticsearch. The query platform reads directly from ES, while older data is periodically offloaded to object storage.
Log ingestion is delegated to business teams via an automated ticket system; deployment and scaling are managed by Ansible through the MAX release platform. Metadata (log paths, retention policies) is stored in a management service, and index lifecycle is automated.
Key Implementations
Million‑writes‑per‑second : Use cloud SSDs (10× performance of HDD), create separate indices per application, let ES auto‑generate document IDs, and employ asynchronous disk writes with optimistic failure handling.
Hundred‑billion‑record sub‑second queries : Reduce filter cardinality and eliminate unnecessary operations; employ link‑trace visualizations to illustrate query flow.
Data latency control : Critical services achieve near‑real‑time ES ingestion; less critical workloads tolerate minute‑level delays.
Cost control : On‑demand log ingestion, retain only seven days of hot data in ES, compress older logs (≈30% size reduction), and move stale data to cheap object storage.
Automation : Configuration is centralized, scaling is automated via Ansible, and all components are integrated into a unified monitoring and alerting system.
Cross‑cloud strategy : Write traffic stays within the local cloud; query traffic traverses dedicated lines only when necessary, with UC as the primary cluster and TC as a secondary replica.
Stability measures : Pause log collection during peak writes, limit query time ranges and hierarchy depth, and partition large index sets into logical clusters to balance load.
Platform Features
Full‑link logs : Correlate requestId/traceId across services for end‑to‑end visibility.
Context logs : View all logs for a single request in chronological order.
Surrounding logs : Inspect logs before and after a target entry to aid root‑cause analysis.
Real‑time logs : Stream logs with minimal latency.
Log download : Export logs for offline analysis.
Log subscription : Feed massive real data to automated test platforms.
Usable UI : No DSL knowledge required; parameters are form‑based, queries can be saved, one‑click copy, IP filtering, and quick time‑range shortcuts.
Current Status
The platform is operational, visualized by the diagrams included in the original document.
Future Plans
Containerization : Ensure log continuity across container restarts, embed container metadata in logs, and maintain collector stability and isolation.
Full‑log platform : Expand data source coverage and apply advanced log mining to extract additional business value.
Miss Fresh Tech Team
Miss Fresh Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.