Big Data 9 min read

Mastering Apache Flume: Architecture, Components, and Key Features

This article provides a comprehensive overview of Apache Flume, detailing its purpose as a distributed log aggregation system, explaining its core components such as sources, channels, and sinks, and illustrating its architecture, multi‑agent setups, and key features like reliability, scalability, compression, and monitoring.

Programmer DD

Mar 28, 2021

Mastering Apache Flume: Architecture, Components, and Key Features

Flume Introduction

Flume is an open‑source log system, a distributed, reliable, and highly available massive‑log aggregation framework that can be customized to collect data from various sources and write it to configurable destinations.

Flume Overview

Flume provides simple data processing capabilities and can write data to a variety of customizable receivers.

What is Flume?

Flume is a streaming log collection tool that can gather data from local files (spooling directory source), real‑time logs (taildir, exec), REST messages, Thrift, Avro, Syslog, Kafka, and other sources.

What can Flume do?

Collect logs from fixed directories to destinations such as HDFS, HBase, or Kafka.

Perform real‑time log collection (taildir) to destinations.

Support cascading of multiple Flume agents for data merging.

Allow user‑defined data collection configurations.

Flume's Position in FusionInsight

Flume is a distributed framework for collecting and aggregating event streams.

Flume System Architecture

Flume Basic Architecture

Flume can collect data on a single node, primarily for intra‑cluster use.

Flume Multi‑Agent Architecture

Multiple Flume nodes can be linked together, allowing data from external sources to be collected and stored inside the cluster.

Flume Architecture

Components include:

events – the basic data unit transferred by Flume.

Interceptor – filters and modifies events according to user configuration.

Channel Selector – routes events to different channels based on configuration.

Channel – temporary buffer for events.

Sink Runner – drives the Sink Processor.

Sink Processor – implements strategies such as load balancing, failover, and pass‑through.

Sink – writes events from a channel to a destination.

Basic Concept – Source

Sources receive events or generate them via special mechanisms and batch‑push them to one or more channels. Sources can be driver‑type (external systems push data) or polling‑type (Flume pulls data).

Driver‑type Source – external system actively sends data to Flume.

Polling Source – Flume periodically pulls data.

A source must be associated with at least one channel.

Basic Concept – Channel

Channels sit between sources and sinks, acting as queues that temporarily store incoming events until sinks successfully transfer them onward.

Different channel types provide varying durability:

Memory Channel – stores events in memory; high throughput but no persistence.

File Channel – persists events using a write‑ahead log; requires configuration of data and checkpoint directories.

JDBC Channel – uses an embedded database (Derby) for persistence; can replace a file channel.

Channels support transactions and can connect any number of sources and sinks.

Basic Concept – Sink

Sinks transfer events to the next hop or final destination and remove them from the channel upon success.

Sinks must operate on a specific channel.

Sink types include various implementations for writing to HDFS, HBase, Kafka, etc.

Key Features of Flume

Flume Supports Log File Collection

Flume can collect log files from outside the cluster and archive them to HDFS, HBase, or Kafka for downstream analysis and cleaning.

Flume Supports Multi‑Level Cascading and Replication

Multiple Flume agents can be chained together, and each node can replicate data, enabling collection from external nodes and aggregation within the cluster.

Flume Cascading Compression and Encryption

Data transmission between cascading nodes can be compressed and encrypted to improve efficiency and security.

Flume Data Monitoring

Manager UI visualizes source input volume, channel cache size, and sink output volume.

Flume Transmission Reliability

Flume uses transactional management to prevent data loss during transmission; persisted channels (e.g., file channel) survive process or node restarts.

If a downstream Flume node fails, Flume can automatically switch to an alternative path.

Flume Data Filtering

Flume can perform simple filtering and cleaning of events; for complex filtering, users can develop custom interceptor plugins.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flume data ingestion log-aggregation

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.