Big Data 12 min read

Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide

This article introduces Apache Flume NG, a distributed and reliable log collection system, explains its core architecture components such as Event, Flow, Agent, Source, Channel, and Sink, and provides detailed configuration examples for various pipelines, including load‑balancing, failover, and integration with HDFS.

Architect

Apr 3, 2016

Flume NG is a distributed, reliable, and highly available system for efficiently collecting, aggregating, and moving massive log data from diverse sources to a centralized storage system. The NG version is a lightweight tool that supports failover and load balancing.

Key Architectural Concepts

Event : a data unit with an optional header.

Flow : an abstract representation of an Event’s migration from source to destination.

Client : operates at the source side to send Events to a Flume Agent.

Agent : an independent Flume process containing a Source, a Channel, and a Sink.

Source : consumes Events generated by external systems.

Channel : a temporary storage that holds Events passed from the Source.

Sink : reads Events from a Channel and forwards them to the next Agent or final storage (e.g., HDFS).

The typical data flow is: external system → Source → Channel → Sink → storage (e.g., HDFS).

Typical Flow Configurations

Multiple agents connected sequentially.

Multiple agents aggregating into a single downstream agent.

Multiplexing agents using a selector for replication or routing based on header values.

Load‑balancing Sink Processor that distributes Events from a Channel to several Sinks.

Failover Sink Processor that maintains a priority list of Sinks and switches when a Sink becomes unavailable.

Basic Functionalities

Flume NG supports a wide range of Source, Channel, and Sink types.

Source Types

Source Type

Description

Avro Source

Built‑in support for Avro RPC.

Thrift Source

Built‑in support for Thrift protocol.

Exec Source

Executes a Unix command and reads its standard output as Events.

JMS Source

Reads messages from a JMS broker (e.g., ActiveMQ).

Spooling Directory Source

Monitors a directory for new files.

Twitter 1% Firehose Source

Streams a sample of Twitter data via API.

Netcat Source

Listens on a port and treats each line as an Event.

Sequence Generator Source

Generates sequential data.

Syslog Source

Consumes syslog data over UDP/TCP.

HTTP Source

Accepts HTTP POST/GET requests (JSON, BLOB).

Legacy Sources

Compatibility with Flume OG sources.

Channel Types

Channel Type

Description

Memory Channel

Stores Events in memory.

JDBC Channel

Persists Events in a relational database (Derby supported).

File Channel

Persists Events to disk files.

Spillable Memory Channel

Hybrid memory‑disk storage; experimental.

Pseudo Transaction Channel

Used for testing.

Custom Channel

User‑defined implementation.

Sink Types

Sink Type

Description

HDFS Sink

Writes data to HDFS.

Logger Sink

Writes data to log files.

Avro Sink

Converts Events to Avro and sends via RPC.

Thrift Sink

Converts Events to Thrift and sends via RPC.

IRC Sink

Replays data on IRC.

File Roll Sink

Writes data to local file system with rolling.

Null Sink

Discards all data.

HBase Sink

Writes data to HBase.

Morphline Solr Sink

Sends data to Solr clusters.

ElasticSearch Sink

Sends data to Elasticsearch clusters.

Kite Dataset Sink

Writes data to Kite Dataset (experimental).

Custom Sink

User‑defined implementation.

Additional components such as Channel Selectors, Sink Processors, Event Serializers, and Interceptors are also available.

Practical Application

Installation is straightforward; the article demonstrates using Flume NG version 1.5.0.1. Example configurations (all using a Memory Channel for simplicity) include:

Avro Source + Memory Channel + Logger Sink

Avro Source + Memory Channel + HDFS Sink

Spooling Directory Source + Memory Channel + HDFS Sink

Exec Source + Memory Channel + File Roll Sink

Each example shows how to edit the corresponding flume‑conf*.properties file, start the Agent, and send data using an Avro client or command‑line tool. The results are verified by checking logs, HDFS directories, or local file system paths.

The article concludes by encouraging readers to consult the official Flume user manual for more detailed configuration options.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Configuration data ingestion Apache Flume

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.