Why Logs Are the Hidden Backbone of Distributed Systems and Real‑Time Data
This note distills Jay Kreps' extensive blog on logs, explaining their core role in distributed databases, real‑time data pipelines, replication, and state‑machine consistency, and showing how logs unify concepts from version control to streaming architectures.
Preface
This is a study note based on Jay Kreps' extensive blog about logs.
The original article is long, but I finished reading it and was impressed by Jay's technical and architectural expertise, as well as his deep understanding of distributed systems.
Jay Kreps was a Principal Staff Engineer at LinkedIn, now co‑founder and CEO of Confluent, and a primary author of Kafka and Samza.
Source
The Log: What every software engineer should know about real‑time data's unifying abstraction
Notes
2.1 Value of Logs
1) Logs are at the core of several systems:
Distributed graph databases
Distributed search engines
Hadoop
First‑ and second‑generation key‑value stores
2) Logs may be as old as computing itself and are central to distributed data and real‑time computation systems.
3) Logs are known by many names:
Commit log
Transaction log
Write‑ahead log
4) Without understanding logs, you cannot fully grasp:
Databases
NoSQL storage
Key‑value stores
Replication
Paxos algorithm
Hadoop
Version control
Any software system
2.2 What Is a Log?
2.2.1 Overview
Records are appended to the tail of a log.
Records are read left‑to‑right.
Each entry has a unique, ordered sequence number.
The order of records defines a notion of time: earlier records are to the left. The entry number can serve as a timestamp, decoupling logical time from any physical clock.
A log is similar to a file or table, but it is a time‑ordered collection of records.
Logs record what happened and when.
Important distinctions:
The log discussed here differs from typical application logs, which are unstructured, human‑readable logs for debugging.
The logs in this note are programmatically accessed, such as journals or data logs.
Application logs are a specialization of the logs described here.
2.2.2 Logs in Databases
Logs originated early, around the time of IBM's System R. Databases use logs to maintain consistency and durability: before modifying data structures or indexes, the intended changes are recorded.
Because logs are persisted immediately, they provide a reliable source for recovery after a crash.
Logs evolved from an ACID‑ensuring mechanism to a means of data replication between databases.
Major databases (Oracle, MySQL, PostgreSQL) include log‑transfer protocols to replicate changes to slave nodes. Oracle’s XStreams and GoldenGate treat logs as a universal data‑subscription mechanism, and similar components exist in MySQL and PostgreSQL.
Machine‑oriented logs are also used in:
Message systems
Data flows
Real‑time computation
2.2.3 Logs in Distributed Systems
Logs solve two key problems in distributed data systems:
Ordered data changes
Distribution of data across nodes
State Machine Replication Principle: if two deterministic processes start from the same state, receive the same inputs in the same order, they will produce the same outputs and end in the same state.
If two deterministic processes start from the same state and receive identical inputs in identical order, they will produce identical outputs and finish in the same state.
Deterministic means the process’s result does not depend on time or external inputs. Non‑deterministic examples include varying thread execution orders or calls to time‑dependent functions.
In distributed systems, feeding the same log to multiple deterministic replicas ensures they stay synchronized.
A log system disperses nondeterminism in input streams, guaranteeing that all replica nodes processing the same inputs remain consistent.
By using the largest timestamp in a replica’s log as a unique node ID, the timestamp together with the log uniquely represents the node’s state.
Typical usage patterns include:
Recording service requests in the log
Logging state changes before and after responding
Recording a sequence of transformation commands
Logical logs record SQL statements (INSERT, UPDATE, DELETE) that cause changes, while physical logs record the actual row modifications.
Two common replication models:
State‑machine model (active‑active)
Primary‑back model (active‑passive)
In the active‑active model, each replica replays operations like “+1” or “*2” from the log to keep state consistent. In the active‑passive model, a master executes the operations and logs the results.
Maintaining operation order is crucial; reordering leads to divergent results.
Distributed logs can serve as the data structure for consistency algorithms such as Paxos, ZAB, Raft, and Viewstamped Replication.
2.2.4 Changelog
From a database perspective, a changelog of record modifications is dual to a table: logs can reconstruct a table’s state, and table changes can be recorded into a log.
This is the secret to near‑real‑time replication.
The concept mirrors version control: patches (logs) are applied to a branch snapshot (table) to reflect state changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
