Big Data 23 min read

Design and Integration of Flume, Kafka, Storm, Drools, and Redis for Real‑Time ETL Log Analysis

This article presents a modular architecture for real‑time ETL log analysis that combines Flume for log collection, Kafka as a buffering layer, Storm for stream processing, Drools for rule‑based data transformation, and Redis for fast storage, detailing installation, configuration, and code integration steps.

Java Architect Essentials

Aug 21, 2020

Design and Integration of Flume, Kafka, Storm, Drools, and Redis for Real‑Time ETL Log Analysis

1. Introduction

Recording and real‑time analysis of ETL system logs helps monitor key metrics, detect defects, and identify performance bottlenecks. Storm is chosen as the primary framework for real‑time processing because it handles streaming data efficiently, and a modular design improves clarity and fault tolerance.

2. Framework Overview and Installation

2.1 Flume

Flume is a highly available, reliable, distributed log‑collection system. An agent consists of a source, channel, and sink; the source receives data, the channel buffers it, and the sink writes it to the destination. The following configuration creates a simple Avro source and a logger sink using an in‑memory channel:

a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source configuration
#r1.type = avro
#r1.bind = localhost
#r1.port = 44444
# sink configuration
#k1.type = logger
# channel configuration
#c1.type = memory
#c1.capacity = 1000
#c1.transactionCapacity = 100
# bind source and sink to channel
#r1.channels = c1
#k1.channel = c1

Start the agent with flume-ng agent -n a1 -f flume-conf.properties.

2.2 Kafka

Kafka is a distributed message queue originally developed by LinkedIn. It stores messages in topics, each divided into partitions and segments. The storage strategy ensures fast random access via offset indexing and reliable delivery through replication.

To build a Kafka cluster, first set up a Zookeeper ensemble, then edit conf/server.properties to configure broker.id, zookeeper.connect, num.partitions, and host.name. Start each broker to complete the cluster.

2.3 Storm

Storm is a distributed, fault‑tolerant real‑time computation system. It uses spouts (data sources) and bolts (processing units) assembled into a topology, analogous to Hadoop’s MapReduce job. Nimbus coordinates the cluster, while supervisors run workers that execute the topology.

Key concepts:

Stream – an unbounded sequence of tuples.

Spout – emits tuples into a stream.

Bolt – consumes tuples, processes them, and may emit new tuples.

2.4 Drools

Drools is a Java‑based open‑source rule engine that externalizes business logic into DRL files. By integrating Drools into Storm bolts, rule evaluation can be parallelized across workers.

2.5 Redis

Redis is an in‑memory key‑value store supporting rich data structures. Its low latency makes it suitable for fast read/write of processed log data.

3. Integration of the Components

3.1 Flume ↔ ETL System

Log4j2’s FlumeAppender sends log events directly to Flume’s Avro source. The following snippet shows the Flume configuration for the ETL system:

producer.sources = s
producer.channels = c
producer.sinks = r
producer.sources.s.type = avro
producer.sources.s.channels = c
producer.sources.s.bind = 10.200.187.71
producer.sources.s.port = 4141
producer.sinks.r.type = logger
producer.channels.c.type = memory
producer.channels.c.capacity = 1000

3.2 Flume ↔ Kafka

A custom KafkaSink class (shown below) reads events from Flume’s channel and forwards them to a Kafka topic.

public class KafkaSink extends AbstractSink implements Configurable {
    private static final Logger LOGGER = LoggerFactory.getLogger(KafkaSink.class);
    private Properties parameters;
    private Producer<String, String> producer;
    private Context context;
    @Override
    public void configure(Context context) { /* ... */ }
    @Override
    public synchronized void start() { /* ... */ }
    @Override
    public Status process() throws EventDeliveryException { /* ... */ }
    @Override
    public void stop() { producer.close(); }
}

3.3 Kafka ↔ Storm

The KafkaSpout (based on the wurstmeister‑kafka‑0.8‑plus plugin) consumes messages from Kafka and emits them into a Storm topology. Core methods such as open, nextTuple, ack, and fail manage offsets via Zookeeper.

public class KafkaSpout extends BaseRichSpout { /* ... */ }

3.4 Storm Bolt ↔ Drools

A Storm bolt named LogRulesBolt loads a DRL file, creates a stateless Drools session, and applies rules to each log entry before emitting the result.

public class LogRulesBolt implements IBasicBolt { /* ... */ }

4. Reflections

4.1 Advantages

Modular design decouples components, improving fault tolerance.

Kafka buffers mismatched speeds between Flume and Storm.

Drools externalizes rules, enabling hot updates without code changes.

Integrating Drools into Storm provides parallel rule execution.

Redis offers fast in‑memory storage for processed data.

4.2 Open Issues

System has not been stress‑tested with large data volumes.

Current Flume‑to‑Kafka implementation targets a single broker/partition; multi‑broker support needs further work.

Topology restart is required after rule changes; hot‑loading is desirable.

Drools performance may become a bottleneck; alternatives like Esper could be explored.

Evaluating whether Flume is essential or can be replaced by direct Kafka ingestion.

Parameter tuning for Flume, Kafka, Storm, and Redis is critical for optimal performance.

4.3 Framework‑Level Thoughts

Flume’s Java source/sink architecture is extensible for custom integrations.

Kafka’s design of client‑side offset management and sequential disk I/O offers high throughput.

Storm’s mix of Clojure and Java provides a compact yet powerful streaming engine.

Redis’s C implementation yields minimal latency for key‑value operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Kafka Flume Storm Drools

Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.