Design and Integration of a Real-Time Log Analysis System Using Flume, Kafka, Storm, Drools, and Redis
This article details the design, installation, and modular integration of Flume, Kafka, Storm, Drools, and Redis to build a real‑time log analysis pipeline for ETL systems, discussing architecture, configuration, code examples, and practical considerations for scalability and fault tolerance.
The article begins by explaining the importance of recording and analyzing ETL system logs in real time to monitor performance metrics, detect defects, and identify bottlenecks.
1. Overview
Real‑time analysis is achieved using Apache Storm as the primary computation engine, with a modular design that separates data collection, buffering, processing, and storage to improve clarity and fault isolation.
2. Framework Introduction and Installation
2.1 Flume
Flume is a reliable, distributed log collection system. An agent consists of a source, channel, and sink, forming a pipeline similar to a water pipe where data flows from source to sink via a channel.
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source configuration
# r1.type = avro (receives data via Avro protocol)
a1.sources.r1.type = avro
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# sink configuration (logger writes to console)
a1.sinks.k1.type = logger
# channel configuration (memory storage)
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1Start the agent with flume-ng agent –n a1 –f flume-conf.properties .
2.2 Kafka
Kafka is a distributed message queue used for log buffering. It stores messages in topics, partitions, and segments, providing high‑throughput sequential writes and reads.
Installation steps include setting up a Zookeeper cluster, downloading Kafka, modifying conf/server.properties (broker.id, zookeeper.connect, partitions, host.name), and starting the brokers.
2.3 Storm
Storm is a fault‑tolerant real‑time computation system analogous to Hadoop for batch processing. It uses spouts (data sources) and bolts (processing units) assembled into a topology.
Typical Storm cluster components: Nimbus (master) and Supervisors (workers). Example configuration snippets:
storm.zookeeper.servers:
- "10.200.187.71"
- "10.200.187.73"
storm.local.dir: "/usr/endy/fks/storm-workdir"
storm.messaging.transport: "backtype.storm.messaging.netty.Context"
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 100
storm.messaging.netty.max_wait_ms: 1000
storm.messaging.netty.min_wait_ms: 100Supervisor configuration includes slot ports (e.g., 6700‑6702) and the Nimbus host.
2.4 Drools
Drools is a Java‑based rule engine that externalizes complex processing logic into DRL files, allowing rule changes without code recompilation. It is integrated into Storm bolts to achieve distributed rule execution.
2.5 Redis
Redis provides an in‑memory key‑value store with rich data structures, offering fast read/write performance for the processed log data.
3. Integration of the Components
3.1 Flume and ETL
Log4j2’s FlumeAppender sends logs directly to an Avro source defined in Flume. Example configuration:
producer.sources = s
producer.channels = c
producer.sinks = r
producer.sources.s.type = avro
producer.sources.s.channels = c
producer.sources.s.bind = 10.200.187.71
producer.sources.s.port = 41413.2 Flume‑Kafka Integration
A custom KafkaSink class forwards Flume events to a Kafka topic. Key parts of the implementation include configuring the producer, handling transactions, and sending KeyedMessage objects.
public class KafkaSink extends AbstractSink implements Configurable {
private static final Logger LOGGER = LoggerFactory.getLogger(KafkaSink.class);
private Properties parameters;
private Producer
producer;
private Context context;
@Override
public void configure(Context context) { /* ... */ }
@Override
public synchronized void start() { /* ... */ }
@Override
public Status process() throws EventDeliveryException { /* ... */ }
@Override
public void stop() { producer.close(); }
}The sink is declared in flume-conf.properties with Kafka connection details (broker list, serializer, acks, etc.).
3.3 Kafka‑Storm Integration
The KafkaSpout from the storm‑kafka plugin consumes messages from Kafka and emits them into Storm topologies. The spout manages offsets via Zookeeper and supports both static and dynamic host discovery.
public class KafkaSpout extends BaseRichSpout { /* ... */ }3.4 Storm‑Drools Integration
A Storm bolt ( LogRulesBolt ) loads a DRL file, creates a stateless knowledge session, and applies rules to each log entry.
public class LogRulesBolt implements IBasicBolt {
private StatelessKnowledgeSession ksession;
private String drlFile;
@Override
public void prepare(Map stormConf, TopologyContext context) { /* load DRL */ }
@Override
public void execute(Tuple input, BasicOutputCollector collector) { /* apply rules */ }
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(LOG_ENTRY)); }
}3.5 Storm‑Redis Integration
After rule processing, the bolt emits the enriched log entry to the next bolt, which writes the result into Redis for fast storage.
4. Reflections
4.1 Advantages
Modular design improves fault isolation and scalability.
Kafka buffers mismatched speeds between Flume and Storm.
Drools decouples business rules from code, enabling hot updates.
Storm‑Drools integration provides distributed rule execution.
Redis offers low‑latency storage for processed logs.
4.2 Open Issues
System needs extensive performance testing with large data volumes.
Current Flume‑Kafka sink targets a single broker/partition; multi‑broker support is pending.
Topology restart is required after rule changes; hot‑loading is not yet supported.
Drools performance may become a bottleneck; alternatives like Esper are under consideration.
Evaluating whether Flume is essential or if direct Kafka ingestion from ETL is sufficient.
Parameter tuning for Flume, Kafka, Storm, and Redis is critical for optimal performance.
4.3 Framework‑Level Thoughts
Flume’s Java‑based source and sink interfaces are extensible for custom integrations.
Kafka’s design leverages sequential disk I/O and client‑side offset management for high throughput.
Storm combines Clojure, Java, and Python, with a lightweight local cluster mode for testing.
Redis’s C implementation provides fast in‑memory data access.
Overall, the modular architecture enables flexible, real‑time log analysis while highlighting areas for future optimization and scalability improvements.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.