Big Data 7 min read

Apache Flume Quickstart: Log Collection and Kafka Integration

This article introduces Apache Flume, explains its design goals of reliability, scalability, manageability and extensibility, outlines core concepts and architecture, provides step‑by‑step configuration using the first mode, demonstrates integration with Zookeeper, Kafka and a shell script, and shows how to launch and verify the agent.

dbaplus Community
dbaplus Community
dbaplus Community
Apache Flume Quickstart: Log Collection and Kafka Integration

Overview

Apache Flume is a distributed, reliable, and highly‑available system for aggregating large volumes of log data. It lets users define custom data sources (agents), apply simple processing, and deliver events to configurable destinations (sinks).

Design Goals

Reliability : Guarantees no data loss on node failure. Supports three reliability levels – end‑to‑end (writes to disk before acknowledgement), store‑on‑failure (writes locally when the receiver crashes), and best‑effort (no acknowledgement).

Scalability : Three‑tier architecture (agent → collector → storage) can be horizontally scaled. Multiple masters are coordinated via ZooKeeper to avoid a single point of failure.

Manageability : Centralized master management, web UI and shell commands for monitoring and configuring data flows.

Extensibility : Users can add custom sources, channels, or sinks. Built‑in components include file, syslog, HDFS, Kafka, etc.

Core Concepts

Data flows through three logical components:

Source : Generates events (e.g., reads a file, executes a command).

Channel : Queues events; can be memory‑based or file‑based.

Sink : Consumes events and writes them to a destination such as HDFS, Kafka, or a database.

A source may write to one or more channels, while a sink reads from a single channel. An agent can contain multiple sources, channels, and sinks.

Architecture Diagram

Installation

Download Flume 1.6.0 from http://flume.apache.org/ , extract the archive, and set JAVA_HOME if not already defined.

Configuration Example (Pattern 1 – Source → Channel → Sink)

The following conf/hw.conf defines a simple pipeline that reads a log file via an exec source, buffers events in a memory channel, and writes them to a Kafka topic.

agent.sources = r1
agent.channels = c1
agent.sinks = k1

# Source configuration
agent.sources.r1.type = exec
agent.sources.r1.command = /path/to/output.sh
agent.sources.r1.channels = c1

# Channel configuration
agent.channels.c1.type = memory
agent.channels.c1.capacity = 10000
agent.channels.c1.transactionCapacity = 1000

# Sink configuration (Kafka)
agent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.topic = test_topic
agent.sinks.k1.brokerList = localhost:9092
agent.sinks.k1.channel = c1

Running the Agent

./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console

If the last line of the console output contains Component type:SINK,name:k1 started, the agent is running.

Prerequisite Services

Start ZooKeeper: ./zkServer.sh start Start Kafka (ensure the broker is listening on the port specified in the sink configuration).

Generating Test Data

Create an executable shell script output.sh that continuously appends the string test to a log file:

#!/bin/bash
while true; do echo "test" >> /path/to/abc.log; sleep 1; done

Make it executable ( chmod +x output.sh) and run it with ./output.sh. The exec source will read the file and emit each line as an event.

Verification

Check Flume’s log files for any errors and use a Kafka consumer (e.g.,

kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test_topic --from-beginning

) to confirm that the test messages are arriving in the target topic.

These steps provide a minimal, functional Flume deployment that can be extended with additional sources, channels, sinks, or custom processing logic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datashell scriptKafka Integrationlog aggregationApache Flume
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.