Log Collection and Processing Architecture with Flume and Kafka for Big Data Platforms
This article explains how to design a scalable log collection system for big‑data platforms by combining Flume for data ingestion, Kafka for buffering and high‑throughput transport, and downstream processing components, providing configuration examples and best‑practice recommendations.
Big data platforms generate massive logs daily, requiring a dedicated log system that bridges data‑producing applications and analytical systems while remaining loosely coupled.
Such a system should support near‑real‑time online analysis as well as offline batch processing (e.g., Hadoop), and must be highly scalable through horizontal node expansion.
To meet these needs, the log collection architecture is divided into four modules:
Data collection – real‑time ingestion from nodes (recommended tool: Flume‑NG).
Data buffering – decouples ingestion speed from processing speed (recommended tool: Kafka).
Stream processing – real‑time analysis of collected data (recommended tool: Storm).
Data output – persisting results to storage such as HDFS or MySQL.
Flume is a distributed, reliable, high‑availability log collector. Its core components are Source, Channel, and Sink, forming an Agent that can be chained to build multi‑layer flows, support multiplexing, and provide load‑balancing.
Kafka is a distributed publish‑subscribe messaging system that stores messages on disk with O(1) access cost, offers high throughput (≈250 k messages/s), and scales horizontally without downtime.
While both Flume and Kafka can handle data collection, Flume excels at configuration‑driven ingestion without programming, whereas Kafka provides higher throughput and stronger reliability when custom producers/consumers are implemented.
Recommendation : Use Flume as the data producer and Kafka Sink as the consumer to combine Flume’s ease of configuration with Kafka’s performance and reliability. For stricter reliability, employ Kafka as Flume’s Channel.
Flume‑Kafka Integration Example
Assume Flume reads /data1/logs/component_role.log and forwards it to Kafka topic mytopic. The following Flume agent configuration demonstrates this setup:
gent1.sources = logsrc
agent1.channels = memcnl
agent1.sinks = kafkasink
# source section
agent1.sources.logsrc.type = exec
agent1.sources.logsrc.command = tail -F /data1/logs/component_role.log
agent1.sources.logsrc.shell = /bin/sh -c
agent1.sources.logsrc.batchSize = 50
agent1.sources.logsrc.channels = memcnl
# sink definition
agent1.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafkasink.brokerList = zdh100:9092, zdh101:9092, zdh102:9093
agent1.sinks.kafkasink.topic = mytopic
agent1.sinks.kafkasink.requiredAcks = 1
agent1.sinks.kafkasink.batchSize = 20
agent1.sinks.kafkasink.channel = memcnl
# channel definition
agent1.channels.memcnl.type = memory
agent1.channels.memcnl.capacity = 1000Start the Flume agent:
/home/mr/flume/bin/flume-ng agent -c /home/mr/flume/conf -f /home/mr/flume/conf/flume-conf.properties -n agent1 -Dflume.monitoring.type=http -Dflume.monitoring.port=10100Append test logs to the source file:
echo "测试代码" >> /data1/logs/component_role.log
echo "检测Flume+Kafka数据管道通畅" >> /data1/logs/component_role.logVerify that Kafka received the messages:
/home/mr/kafka/bin/kafka-console-consumer.sh --zookeeper zdh100:2181 --topic mytopic --from-beginningThe consumer output should display the newly appended log lines, confirming the end‑to‑end pipeline works.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
