Fundamentals 12 min read

Why Logs Are the Hidden Backbone of Distributed Systems and Real‑Time Data

This note distills Jay Kreps' extensive blog on logs, explaining their core role in distributed databases, real‑time data pipelines, replication, and state‑machine consistency, and showing how logs unify concepts from version control to streaming architectures.

21CTO

Jul 12, 2017

Why Logs Are the Hidden Backbone of Distributed Systems and Real‑Time Data

Preface

This is a study note based on Jay Kreps' extensive blog about logs.

The original article is long, but I finished reading it and was impressed by Jay's technical and architectural expertise, as well as his deep understanding of distributed systems.

Jay Kreps was a Principal Staff Engineer at LinkedIn, now co‑founder and CEO of Confluent, and a primary author of Kafka and Samza.

Source

The Log: What every software engineer should know about real‑time data's unifying abstraction

Notes

2.1 Value of Logs

1) Logs are at the core of several systems:

Distributed graph databases

Distributed search engines

Hadoop

First‑ and second‑generation key‑value stores

2) Logs may be as old as computing itself and are central to distributed data and real‑time computation systems.

3) Logs are known by many names:

Commit log

Transaction log

Write‑ahead log

4) Without understanding logs, you cannot fully grasp:

Databases

NoSQL storage

Key‑value stores

Replication

Paxos algorithm

Hadoop

Version control

Any software system

2.2 What Is a Log?

2.2.1 Overview

Records are appended to the tail of a log.

Records are read left‑to‑right.

Each entry has a unique, ordered sequence number.

The order of records defines a notion of time: earlier records are to the left. The entry number can serve as a timestamp, decoupling logical time from any physical clock.

A log is similar to a file or table, but it is a time‑ordered collection of records.

Logs record what happened and when.

Important distinctions:

The log discussed here differs from typical application logs, which are unstructured, human‑readable logs for debugging.

The logs in this note are programmatically accessed, such as journals or data logs.

Application logs are a specialization of the logs described here.

2.2.2 Logs in Databases

Logs originated early, around the time of IBM's System R. Databases use logs to maintain consistency and durability: before modifying data structures or indexes, the intended changes are recorded.

Because logs are persisted immediately, they provide a reliable source for recovery after a crash.

Logs evolved from an ACID‑ensuring mechanism to a means of data replication between databases.

Major databases (Oracle, MySQL, PostgreSQL) include log‑transfer protocols to replicate changes to slave nodes. Oracle’s XStreams and GoldenGate treat logs as a universal data‑subscription mechanism, and similar components exist in MySQL and PostgreSQL.

Machine‑oriented logs are also used in:

Message systems

Data flows

Real‑time computation

2.2.3 Logs in Distributed Systems

Logs solve two key problems in distributed data systems:

Ordered data changes

Distribution of data across nodes

State Machine Replication Principle: if two deterministic processes start from the same state, receive the same inputs in the same order, they will produce the same outputs and end in the same state.

If two deterministic processes start from the same state and receive identical inputs in identical order, they will produce identical outputs and finish in the same state.

Deterministic means the process’s result does not depend on time or external inputs. Non‑deterministic examples include varying thread execution orders or calls to time‑dependent functions.

In distributed systems, feeding the same log to multiple deterministic replicas ensures they stay synchronized.

A log system disperses nondeterminism in input streams, guaranteeing that all replica nodes processing the same inputs remain consistent.

By using the largest timestamp in a replica’s log as a unique node ID, the timestamp together with the log uniquely represents the node’s state.

Typical usage patterns include:

Recording service requests in the log

Logging state changes before and after responding

Recording a sequence of transformation commands

Logical logs record SQL statements (INSERT, UPDATE, DELETE) that cause changes, while physical logs record the actual row modifications.

Two common replication models:

State‑machine model (active‑active)

Primary‑back model (active‑passive)

In the active‑active model, each replica replays operations like “+1” or “*2” from the log to keep state consistent. In the active‑passive model, a master executes the operations and logs the results.

Maintaining operation order is crucial; reordering leads to divergent results.

Distributed logs can serve as the data structure for consistency algorithms such as Paxos, ZAB, Raft, and Viewstamped Replication.

2.2.4 Changelog

From a database perspective, a changelog of record modifications is dual to a table: logs can reconstruct a table’s state, and table changes can be recorded into a log.

This is the secret to near‑real‑time replication.

The concept mirrors version control: patches (logs) are applied to a branch snapshot (table) to reflect state changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

State Machine Real-time Data Data Replication logs

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.