Big Data 18 min read

Which Log Collection System Wins? Scribe, Chukwa, Kafka, Flume & ELK Compared

This article reviews the background, requirements, and architectural designs of major open‑source log collection systems—including Facebook’s Scribe, Apache’s Chukwa, LinkedIn’s Kafka, Cloudera’s Flume—and evaluates mature monitoring tools such as ELK, highlighting their features, use cases, advantages, and drawbacks for large‑scale log processing.

21CTO

Oct 30, 2020

Which Log Collection System Wins? Scribe, Chukwa, Kafka, Flume & ELK Compared

1. Background Introduction

Many platforms generate massive daily logs (typically streaming data such as page views, queries, etc.). Processing these logs requires a dedicated log system with the following characteristics: (1) decouple application and analysis systems; (2) support near‑real‑time online analysis and offline batch analysis (e.g., Hadoop); (3) high scalability via horizontal node expansion.

2. Log System Comparison

How to Collect System Logs and Analyze Them

A. Real‑time Mode

Deploy an agent on the log‑producing server.

The agent uploads log increments to a compute cluster using low‑cost methods.

The compute cluster parses logs, computes results in a distributed, load‑balanced manner, optionally using multi‑layer architecture for aggregation.

Write results to the most suitable storage (e.g., time‑series store for periodic analysis).

Build a query/reporting system on top of the storage.

Common compute technology: Storm.

B. Near‑real‑time Mode

Deploy an agent on the log‑producing server.

The agent uploads log increments to a buffering cluster.

The buffering cluster writes raw logs to HDFS‑type storage.

Use Hadoop jobs to parse logs and compute results.

Write results to HBase.

Use Hadoop‑derived modeling and query tools to generate reports.

Supplement: Hive can simplify processing.

2.1 Common Open‑Source Log Systems Comparison

A. Facebook’s Scribe

Scribe is an open‑source log collection system used extensively within Facebook. It gathers logs from various sources and stores them in a central storage system (e.g., NFS, distributed file systems) for centralized statistical analysis. It offers a scalable, highly fault‑tolerant solution for distributed log collection and unified processing.

Features: strong fault tolerance—if the backend storage crashes, Scribe writes data locally and reloads it once the storage recovers.

B. Apache’s Chukwa

Chukwa is a relatively new open‑source project built on the Hadoop ecosystem, using HDFS for storage and MapReduce for processing. It provides modules to support log analysis on Hadoop clusters.

Requirements: (1) flexible, dynamically controllable data sources; (2) high‑performance, highly scalable storage; (3) suitable framework for large‑scale data analysis.

C. LinkedIn’s Kafka

Kafka, open‑sourced in December 2010, is written in Scala and employs several performance optimizations. Its design goals include O(1) disk access cost, high throughput (tens of thousands of messages per second on commodity servers), distributed partitioned architecture, and parallel loading into Hadoop.

Kafka is a publish‑subscribe messaging system. Producers publish messages to topics; consumers subscribe to topics. Topics are divided into partitions for load balancing. Zookeeper is used for broker discovery and load balancing.

D. Cloudera’s Flume

Flume, open‑sourced by Cloudera in July 2009, provides a comprehensive set of components that require little custom development.

Design goals:

Reliability : three levels of guarantee—end‑to‑end (event written to disk on agent, deleted after successful transmission), store‑on‑failure (similar to Scribe), and best‑effort.

Scalability : three‑tier architecture (agent, collector, storage) with horizontal scaling; master manages agents and collectors using ZooKeeper for high availability.

Manageability : centralized master allows monitoring and dynamic configuration via web UI or shell scripts.

Extensibility : users can add custom agents, collectors, or storage; many built‑in components are available (file, syslog, HDFS, HBase, etc.).

Flume’s layered architecture consists of agents, collectors, and storage. Agents send data to collectors; collectors aggregate data and write to storage (file, HDFS, Hive, HBase, etc.). Example: an agent listening on TCP port 5140 forwards data to a collector, which loads it into HDFS.

E. Summary

Typical log systems consist of three core components: an agent (encapsulates data sources and forwards data), a collector (aggregates data from multiple agents and writes to a central store), and a store (centralized storage with scalability and reliability, often HDFS).

3. Mature Log Monitoring and Analysis Tools

1. ELK

A. ELK Overview

ELK (Elasticsearch, Logstash, Kibana) is widely used in server operations for log monitoring and analysis. Front‑end logs are highly customizable, unlike back‑end logs with fixed formats. ELK provides a complete pipeline: Elasticsearch for search, Logstash for collection/parsing, and Kibana for visualization.

Elasticsearch: distributed, RESTful search engine built on Lucene, suitable for real‑time search and analytics.

Logstash: log/event management tool for collection, transformation, and forwarding.

Kibana: front‑end visualization framework offering detailed charts and dashboards.

B. ELK Use Cases

Many companies (e.g., Sina, Ele.me, Ctrip) adopt this architecture. It can be applied to front‑end log analysis as well, enabling business data analysis, error log analysis, and data alerting.

Business data analysis

Error log analysis (similar to Bugly)

Data alerting before large‑scale failures

C. ELK Advantages

Powerful search (Elasticsearch) and rich visualization (Kibana) enable fast data retrieval and detailed dashboards.

D. ELK Disadvantages

Three separate systems lack unified deployment/management tools.

Complex permission management for multi‑tenant scenarios.

Security vulnerabilities (e.g., past Elasticsearch exploits).

Limited data mining capabilities without deep development.

2. EFK

EFK replaces Logstash with Fluentd, which also supports Elasticsearch as a destination, offering an alternative log collection solution.

3. Logstash vs Fluentd Comparison

Both have many plugins and active maintenance. Logstash offers strong parallelism and Grok support on the JVM; Fluentd lacks Windows support. Both provide lightweight agents for sending logs to mature back‑ends.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Kafka ELK log collection Flume

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.