Apache Flume Quickstart: Log Collection and Kafka Integration
This article introduces Apache Flume, explains its design goals of reliability, scalability, manageability and extensibility, outlines core concepts and architecture, provides step‑by‑step configuration using the first mode, demonstrates integration with Zookeeper, Kafka and a shell script, and shows how to launch and verify the agent.
Overview
Apache Flume is a distributed, reliable, and highly‑available system for aggregating large volumes of log data. It lets users define custom data sources (agents), apply simple processing, and deliver events to configurable destinations (sinks).
Design Goals
Reliability : Guarantees no data loss on node failure. Supports three reliability levels – end‑to‑end (writes to disk before acknowledgement), store‑on‑failure (writes locally when the receiver crashes), and best‑effort (no acknowledgement).
Scalability : Three‑tier architecture (agent → collector → storage) can be horizontally scaled. Multiple masters are coordinated via ZooKeeper to avoid a single point of failure.
Manageability : Centralized master management, web UI and shell commands for monitoring and configuring data flows.
Extensibility : Users can add custom sources, channels, or sinks. Built‑in components include file, syslog, HDFS, Kafka, etc.
Core Concepts
Data flows through three logical components:
Source : Generates events (e.g., reads a file, executes a command).
Channel : Queues events; can be memory‑based or file‑based.
Sink : Consumes events and writes them to a destination such as HDFS, Kafka, or a database.
A source may write to one or more channels, while a sink reads from a single channel. An agent can contain multiple sources, channels, and sinks.
Architecture Diagram
Installation
Download Flume 1.6.0 from http://flume.apache.org/ , extract the archive, and set JAVA_HOME if not already defined.
Configuration Example (Pattern 1 – Source → Channel → Sink)
The following conf/hw.conf defines a simple pipeline that reads a log file via an exec source, buffers events in a memory channel, and writes them to a Kafka topic.
agent.sources = r1
agent.channels = c1
agent.sinks = k1
# Source configuration
agent.sources.r1.type = exec
agent.sources.r1.command = /path/to/output.sh
agent.sources.r1.channels = c1
# Channel configuration
agent.channels.c1.type = memory
agent.channels.c1.capacity = 10000
agent.channels.c1.transactionCapacity = 1000
# Sink configuration (Kafka)
agent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.topic = test_topic
agent.sinks.k1.brokerList = localhost:9092
agent.sinks.k1.channel = c1Running the Agent
./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,consoleIf the last line of the console output contains Component type:SINK,name:k1 started, the agent is running.
Prerequisite Services
Start ZooKeeper: ./zkServer.sh start Start Kafka (ensure the broker is listening on the port specified in the sink configuration).
Generating Test Data
Create an executable shell script output.sh that continuously appends the string test to a log file:
#!/bin/bash
while true; do echo "test" >> /path/to/abc.log; sleep 1; doneMake it executable ( chmod +x output.sh) and run it with ./output.sh. The exec source will read the file and emit each line as an event.
Verification
Check Flume’s log files for any errors and use a Kafka consumer (e.g.,
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test_topic --from-beginning) to confirm that the test messages are arriving in the target topic.
These steps provide a minimal, functional Flume deployment that can be extended with additional sources, channels, sinks, or custom processing logic.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
