Understanding Logs: The Core of Distributed Systems and Data Integration
This article explains how logs—simple, append‑only, time‑ordered records—serve as the fundamental abstraction behind databases, distributed systems, data integration pipelines, and stream‑processing platforms like Kafka and Hadoop, illustrating their role in ordering, replication, scalability, and real‑time analytics.
I joined LinkedIn six years ago and helped move from a single, centralized database to a suite of distributed systems, including a distributed graph database, search backend, Hadoop installations, and first‑ and second‑generation key‑value stores.
From that experience I learned that the core idea behind many of these systems is the log—sometimes called a pre‑write log, commit log, or transaction log—which has existed since the earliest computers and underpins most distributed data systems and real‑time applications.
If you don’t understand logs, you can’t fully grasp databases, NoSQL stores, replication, Paxos, Hadoop, version control, or most software systems; yet most engineers are unfamiliar with them. This post aims to change that by covering what logs are and how they are used in data integration, real‑time processing, and system construction.
Part 1: What is a log?
A log is a simple storage abstraction: an append‑only, time‑ordered sequence of records, each with a unique, monotonically increasing identifier that acts like a timestamp.
Logs differ from files (a byte stream) and tables (a set of records) in that they are strictly ordered by time, which becomes crucial when running multiple distributed systems.
The content of a log entry and its format are not important for this discussion; however, a log cannot accept new entries once storage is exhausted.
Logs are essentially ordered tables or files, and they record “what happened when,” which is the key to many distributed data‑system challenges.
It is important to distinguish “application logs” (human‑readable error or trace messages) from the “data log” discussed here, which is designed for programmatic access.
Database logs
Logs have existed since IBM’s System R and are used to ensure atomicity and durability by writing intended changes to a log before applying them to the database.
Over time, logs also became the mechanism for database replication, with Oracle, MySQL, and PostgreSQL providing log‑shipping protocols for standby replicas.
Distributed system logs
Logs solve two problems in distributed systems: ordering of changes and data dissemination. They enable state‑machine replication, where identical deterministic processes starting from the same state and receiving the same ordered inputs produce identical outputs.
Determinism means the processing is independent of time and external nondeterministic inputs; logs provide the ordered input stream that removes nondeterminism.
By assigning each replica a logical timestamp derived from the log, the system can keep replicas consistent even when some fail.
Physical logs record every changed row; logical logs record the SQL statements that caused the changes.
Two common replication models are active‑active (all replicas apply the same ordered transformations) and active‑passive (a leader writes the log and followers replay it).
Change‑log 101: Table‑event duality
Change logs can be used to reconstruct tables at any point in time, providing both the current state (the table) and the history of changes (the log).
This duality mirrors version‑control systems, where patches (logs) are applied to snapshots (tables).
Data integration
Data integration means organizing data so that services can access it; ETL is only a subset of this broader concept.
Reliable, complete data flows are essential for Hadoop, real‑time query systems, and downstream analytics.
Two trends make integration harder: the growth of event data (user activity, machine metrics) and the explosion of specialized data systems (search, OLAP, document stores, etc.).
Using a central log as the integration point decouples producers from consumers, allowing many downstream systems (caches, Hadoop, search, analytics) to subscribe without tight coupling.
Log‑centric architecture at LinkedIn
LinkedIn migrated from centralized relational databases to many specialized distributed systems (search, social graph, Voldemort, Espresso, recommendation engine, OLAP, Hadoop, Teradata, Ingraphs, etc.).
The “databus” service provided a log‑caching abstraction over Oracle tables to support social network and search indexing.
Later, Kafka became the central, multi‑subscriber event log, handling hundreds of billions of messages per day.
Kafka achieves scalability through log partitioning, batch reads/writes, and avoiding unnecessary replication.
Each partition is an ordered log; partitions are replicated for fault tolerance, and any replica can become the leader.
Although there is no global order across partitions, ordering within a partition is guaranteed, which is sufficient for most use cases.
Logs act as buffers, allowing producers to run faster than consumers without data loss.
Stream processing (Storm, Samza, etc.) consumes these logs; logs provide the durable, ordered backbone for both batch and real‑time pipelines.
Stateful stream processing stores its state in local tables or indexes, and the state changes are also captured in change logs, enabling recovery after failures.
Log compression in Kafka allows retention policies that keep only the latest state for key‑based data while preserving full history for critical events.
Overall, logs unify data movement, replication, and processing across databases, caches, search, analytics, and stream‑processing frameworks.
Future directions
Three possible trends: continued isolation of systems, consolidation into a single “super‑system,” or a modular, open‑source stack that can be assembled as needed.
Key supporting technologies include Zookeeper (coordination), Mesos/YARN (resource management), Lucene/LevelDB (indexing), Netty/Finagle (networking), Avro/Protobuf/Thrift (serialization), and Kafka/BookKeeper (logs).
Understanding logs is essential for designing reliable, scalable, and maintainable data architectures.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.