Understanding Stream Processing, Event Sourcing, and Complex Event Processing
The article explains the fundamentals of stream processing, event sourcing, and complex event processing, comparing raw event storage with aggregated results, illustrating architectures with Kafka, Samza, and other frameworks, and highlighting benefits such as scalability, flexibility, and decoupling for modern data‑driven systems.
Different terms such as stream processing, event sourcing, CQRS, and Complex Event Processing (CEP) describe similar concepts of handling immutable events; this article follows Martin Kleppmann’s perspective to deepen understanding for better system design.
Stream processing originated from LinkedIn’s large‑scale data systems and is realized in open‑source projects Apache Kafka and Apache Samza; the article uses Google Analytics page‑view events as an illustrative example.
Each page‑view event is a simple immutable fact that records what happened, and these events can be stored in various ways.
Option (a) stores every incoming event in a large database, data warehouse, or Hadoop cluster, allowing queries that scan all events or large subsets for dynamic aggregation.
Option (b) stores aggregated results, such as incrementing a counter for each event or keeping multiple counters in an OLAP cube, enabling fast reads of pre‑computed values without scanning full event logs.
Storing raw events (option a) maximizes analytical flexibility, useful for offline tasks like training recommendation systems, whereas aggregated storage (option b) is advantageous for real‑time decisions such as rate‑limiting, where maintaining per‑IP counters is more efficient.
For simple aggregated updates, counters can be kept in caches like memcached or Redis with atomic increments; more complex setups may introduce an event stream, message queue, or event log, where the stream events share the same structure as the original PageViewEvent.
This architecture allows multiple consumers to process the same event data for different tasks, providing flexibility and easy scalability.
Event sourcing, a concept from domain‑driven design, focuses on how data is stored in databases; using an e‑commerce shopping‑cart example, the article shows that instead of destructive UPDATE statements, each change should be recorded as an immutable event (e.g., AddToCart, UpdateCartQuantity).
Both stream processing and event sourcing share two storage strategies: (a) raw immutable events for ideal write performance (append‑only) and (b) aggregated results for ideal read performance.
The article notes that many databases already implement an immutable write‑ahead log, which is essentially an event stream, with implementations varying across PostgreSQL, InnoDB, Oracle MVCC, CouchDB, Datomic, and LMDB.
At the application level, Martin demonstrates using Apache Kafka as a durable publish‑subscribe message broker and Apache Samza as the processing engine; other popular stream‑processing frameworks such as Storm and Spark Streaming are also mentioned.
Higher‑level stream languages like CEP enable writing queries or rules that continuously match patterns in the event stream, useful for fraud detection or business‑process monitoring.
Related concepts include full‑text search on streams, actor frameworks (Akka, Orleans, Erlang OTP) that are built on immutable event streams, reactive programming models, and change‑data‑capture (CDC) that extracts insert, update, and delete operations into an event stream.
Loose coupling – separate read and write models reduce inter‑component dependencies.
Read/write performance – append‑only writes are fast, while pre‑aggregated reads are fast.
Scalability – the simple abstraction of event streams enables parallel processing across machines.
Flexibility – immutable raw events simplify schema migrations; transformed caches can be rebuilt for new UI needs.
Error handling – immutable events can be replayed to recover from failures.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.