Cloud Native 12 min read

How to Build a Scalable, Reliable K8s Log Platform for Enterprise Needs

This article explains how to design and implement a flexible, high‑performance log system for Kubernetes environments, covering demand‑driven architecture, functional requirements, open‑source component choices, the reasons for a custom solution, and the operational challenges faced at massive scale.

Alibaba Cloud Native

Nov 19, 2019

How to Build a Scalable, Reliable K8s Log Platform for Enterprise Needs

Demand‑Driven Architecture Design

Technical architecture translates product requirements into concrete implementations. A thorough requirement analysis prevents building systems that do not meet real needs.

Requirement Breakdown and Feature Design

The following functional requirements were identified from internal roles:

Collect logs of diverse formats and data sources, including non‑Kubernetes environments.

Fast search and pinpointing of problem logs.

Format semi‑structured or unstructured logs for rapid statistical analysis and visualization.

Real‑time computation of business metrics and alerting (APM‑style).

Multi‑dimensional correlation analysis on massive log volumes with acceptable latency.

Easy integration with external systems or custom data sources (e.g., third‑party audit services).

Intelligent alerting, prediction, and root‑cause analysis using log and time‑series data, with customizable offline training.

Core Functional Modules

Comprehensive log collection (DaemonSet, Sidecar, agents) for containers, web, mobile, IoT, and physical/virtual machines.

Real‑time log pipelines that expose logs to upstream and downstream systems.

ETL‑style data cleansing: filtering, enrichment, transformation, splitting, aggregation.

Log presentation and keyword‑based search with context view.

Real‑time analytics for root‑cause investigation and business‑metric calculation.

Stream processing (Flink, Storm, Spark Streaming) for live metric computation and custom cleaning.

Offline analysis for historical, multi‑dimensional correlation (T+1 latency).

Machine‑learning integration to feed historical logs into training pipelines and serve online inference.

Open‑Source Solution Design

A typical ELK‑centric stack can satisfy the above requirements:

Log agents (Filebeat, Fluentd) collect container logs uniformly.

Kafka provides a high‑throughput buffering layer.

Logstash or Flink consume Kafka streams, perform cleansing, and write the cleaned data back to Kafka.

Cleaned data is indexed in Elasticsearch for real‑time search, processed by Flink for streaming analytics, stored in Hadoop for batch analysis, and fed to TensorFlow for offline model training.

Visualization is handled by Grafana or Kibana.

Why a Self‑Developed Solution?

While open‑source components enable rapid prototyping, large‑scale production introduces challenges:

Scaling Kafka, Elasticsearch, and agents (DaemonSet vs. Sidecar) under petabyte‑scale traffic.

Ensuring low latency and high availability (99.9%‑99.99% SLA) for core business services.

Mitigating noisy‑query impact on shared clusters.

Controlling operational costs when daily ingest reaches multiple petabytes.

Alibaba K8s Log Solution

Custom Logtail agent provides full‑stack K8s data collection; it is deployed at million‑scale across Alibaba Group and has passed double‑11 financial‑grade stress tests.

A unified pipeline integrates queueing, cleansing, real‑time search, analytics, and AI algorithms, reducing data‑path length and failure points.

All components are optimized for log workloads, delivering billion‑level log query latency in seconds, dynamic scaling, low cost, and high availability.

Seamless integration with downstream open‑source and Alibaba Cloud products supports dozens of use cases (streaming, batch, visualization, alerting).

The platform serves Alibaba Group, Ant Group, and thousands of cloud customers, ingesting >16 PB of logs per day.

Remaining Operational Questions

Key open issues that still require engineering effort include:

Choosing the optimal log‑collection strategy on K8s (DaemonSet vs. Sidecar).

Integrating the log solution with CI/CD pipelines.

Partitioning log storage for micro‑service isolation.

Deriving K8s health metrics directly from logs for cluster monitoring.

Implementing reliability monitoring for the log platform itself.

Automating inspection across multiple micro‑services/components.

Rapid anomaly detection and traffic‑spike localization across sites.

Typical Pipeline Example

# Log collection (Logtail or Filebeat) -> Kafka topic "raw-logs"
# Stream cleansing (Flink) reads "raw-logs", applies ETL, writes to "clean-logs"
# Elasticsearch indexes "clean-logs" for keyword search
# Flink computes real‑time metrics from "clean-logs" and pushes alerts
# Hadoop consumes "clean-logs" for offline T+1 analysis
# TensorFlow reads historical logs for model training, model served back to Flink

This concrete pipeline illustrates how data flows from collection to storage, real‑time analytics, batch processing, and machine‑learning integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

scalability Observability kubernetes Logging open source

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.