Big Data 24 min read

Design and Integration of a Real-Time Log Analysis System Using Flume, Kafka, Storm, Drools, and Redis

This article details the design, installation, and modular integration of Flume, Kafka, Storm, Drools, and Redis to build a real‑time log analysis pipeline for ETL systems, discussing architecture, configuration, code examples, and practical considerations for scalability and fault tolerance.

Top Architect
Top Architect
Top Architect
Design and Integration of a Real-Time Log Analysis System Using Flume, Kafka, Storm, Drools, and Redis

The article begins by explaining the importance of recording and analyzing ETL system logs in real time to monitor performance metrics, detect defects, and identify bottlenecks.

1. Overview

Real‑time analysis is achieved using Apache Storm as the primary computation engine, with a modular design that separates data collection, buffering, processing, and storage to improve clarity and fault isolation.

2. Framework Introduction and Installation

2.1 Flume

Flume is a reliable, distributed log collection system. An agent consists of a source, channel, and sink, forming a pipeline similar to a water pipe where data flows from source to sink via a channel.

a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source configuration
# r1.type = avro (receives data via Avro protocol)
a1.sources.r1.type = avro
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# sink configuration (logger writes to console)
a1.sinks.k1.type = logger
# channel configuration (memory storage)
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Start the agent with flume-ng agent –n a1 –f flume-conf.properties .

2.2 Kafka

Kafka is a distributed message queue used for log buffering. It stores messages in topics, partitions, and segments, providing high‑throughput sequential writes and reads.

Installation steps include setting up a Zookeeper cluster, downloading Kafka, modifying conf/server.properties (broker.id, zookeeper.connect, partitions, host.name), and starting the brokers.

2.3 Storm

Storm is a fault‑tolerant real‑time computation system analogous to Hadoop for batch processing. It uses spouts (data sources) and bolts (processing units) assembled into a topology.

Typical Storm cluster components: Nimbus (master) and Supervisors (workers). Example configuration snippets:

storm.zookeeper.servers:
  - "10.200.187.71"
  - "10.200.187.73"
storm.local.dir: "/usr/endy/fks/storm-workdir"
storm.messaging.transport: "backtype.storm.messaging.netty.Context"
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 100
storm.messaging.netty.max_wait_ms: 1000
storm.messaging.netty.min_wait_ms: 100

Supervisor configuration includes slot ports (e.g., 6700‑6702) and the Nimbus host.

2.4 Drools

Drools is a Java‑based rule engine that externalizes complex processing logic into DRL files, allowing rule changes without code recompilation. It is integrated into Storm bolts to achieve distributed rule execution.

2.5 Redis

Redis provides an in‑memory key‑value store with rich data structures, offering fast read/write performance for the processed log data.

3. Integration of the Components

3.1 Flume and ETL

Log4j2’s FlumeAppender sends logs directly to an Avro source defined in Flume. Example configuration:

producer.sources = s
producer.channels = c
producer.sinks = r
producer.sources.s.type = avro
producer.sources.s.channels = c
producer.sources.s.bind = 10.200.187.71
producer.sources.s.port = 4141

3.2 Flume‑Kafka Integration

A custom KafkaSink class forwards Flume events to a Kafka topic. Key parts of the implementation include configuring the producer, handling transactions, and sending KeyedMessage objects.

public class KafkaSink extends AbstractSink implements Configurable {
    private static final Logger LOGGER = LoggerFactory.getLogger(KafkaSink.class);
    private Properties parameters;
    private Producer
producer;
    private Context context;
    @Override
    public void configure(Context context) { /* ... */ }
    @Override
    public synchronized void start() { /* ... */ }
    @Override
    public Status process() throws EventDeliveryException { /* ... */ }
    @Override
    public void stop() { producer.close(); }
}

The sink is declared in flume-conf.properties with Kafka connection details (broker list, serializer, acks, etc.).

3.3 Kafka‑Storm Integration

The KafkaSpout from the storm‑kafka plugin consumes messages from Kafka and emits them into Storm topologies. The spout manages offsets via Zookeeper and supports both static and dynamic host discovery.

public class KafkaSpout extends BaseRichSpout { /* ... */ }

3.4 Storm‑Drools Integration

A Storm bolt ( LogRulesBolt ) loads a DRL file, creates a stateless knowledge session, and applies rules to each log entry.

public class LogRulesBolt implements IBasicBolt {
    private StatelessKnowledgeSession ksession;
    private String drlFile;
    @Override
    public void prepare(Map stormConf, TopologyContext context) { /* load DRL */ }
    @Override
    public void execute(Tuple input, BasicOutputCollector collector) { /* apply rules */ }
    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(LOG_ENTRY)); }
}

3.5 Storm‑Redis Integration

After rule processing, the bolt emits the enriched log entry to the next bolt, which writes the result into Redis for fast storage.

4. Reflections

4.1 Advantages

Modular design improves fault isolation and scalability.

Kafka buffers mismatched speeds between Flume and Storm.

Drools decouples business rules from code, enabling hot updates.

Storm‑Drools integration provides distributed rule execution.

Redis offers low‑latency storage for processed logs.

4.2 Open Issues

System needs extensive performance testing with large data volumes.

Current Flume‑Kafka sink targets a single broker/partition; multi‑broker support is pending.

Topology restart is required after rule changes; hot‑loading is not yet supported.

Drools performance may become a bottleneck; alternatives like Esper are under consideration.

Evaluating whether Flume is essential or if direct Kafka ingestion from ETL is sufficient.

Parameter tuning for Flume, Kafka, Storm, and Redis is critical for optimal performance.

4.3 Framework‑Level Thoughts

Flume’s Java‑based source and sink interfaces are extensible for custom integrations.

Kafka’s design leverages sequential disk I/O and client‑side offset management for high throughput.

Storm combines Clojure, Java, and Python, with a lightweight local cluster mode for testing.

Redis’s C implementation provides fast in‑memory data access.

Overall, the modular architecture enables flexible, real‑time log analysis while highlighting areas for future optimization and scalability improvements.

big dataReal-time ProcessingRedisKafkaflumestormdrools
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.