Databases 18 min read

Understanding Database Write-Ahead Logs (WAL) and Their Implementation in etcd

This article explains common database logging mechanisms such as MySQL redo logs and binlogs, compares them with Redis AOF and etcd's Raft‑based WAL, and provides an in‑depth analysis of etcd's WAL source code, including key structures, creation process, record types, encoding, and file pipeline management.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Understanding Database Write-Ahead Logs (WAL) and Their Implementation in etcd

Part1 Common Database Logs

Traditional database logs such as redo log record modified data. This is essentially the WAL technique used in MySQL, where the log is written before the data is persisted to disk.

redo log is specific to the InnoDB engine; binlog is implemented at the MySQL server layer and can be used by all engines. redo log is a physical log that records what changes were made to a data page.

binlog is a logical log that records the original logic of the statement, e.g., “increment field c of row with ID=2 by 1”. redo log is written in a circular fashion and has a fixed size.

binlog can be appended; when a binlog file reaches a certain size it switches to the next file without overwriting previous logs.

Redis uses Append Only File (AOF), which saves each write command after it succeeds, allowing recovery by replaying the commands. AOF records each command received by Redis as plain text.

etcd validates commands, and the leader stores proposals and log entries via the Raft module in a WAL file to guarantee consistency and recoverability.

Part2 WAL Source Code Analysis

When etcd server starts, it checks for a WAL directory to determine if a previous WAL exists. If not, it calls

wal.Create

to create one; otherwise it uses

wal.Open

and

wal.ReadAll

to reload the existing WAL. The logic resides in

etcd/etcdserver/server.go

within the

NewServer

method.

Key WAL Objects

Important fields:

dir : directory where WAL files are stored.

dirFile : file descriptor for the opened directory.

metadata : byte sequence supplied when creating the WAL, typically containing node and cluster IDs; written to the WAL header.

state : hardState information saved during WAL appends; updated whenever Raft's hardState changes and flushed to disk.

hardState : used after etcd restarts to recover the last hardState; defined as:

<code>type HardState struct {
    Term uint64 `protobuf:"varint,1,opt,name=term" json:"term"`
    Vote uint64 `protobuf:"varint,2,opt,name=vote" json:"vote"`
    Commit uint64 `protobuf:"varint,3,opt,name=commit" json:"commit"`
    XXX_unrecognized []byte `json:"-"`
}</code>

snapshot : metadata about the last snapshot stored in the WAL; defined as:

<code>type Snapshot struct {
    Index uint64 `protobuf:"varint,1,opt,name=index" json:"index"`
    Term uint64 `protobuf:"varint,2,opt,name=term" json:"term"`
    XXX_unrecognized []byte `json:"-"`
}</code>

decoder : deserializes protobuf records when reading WAL files.

readClose : closes the decoder’s reader after

ReadAll

.

enti : index of the last entry saved to the WAL.

encoder : serializes records before writing them to the WAL.

size : pre‑allocation size for temporary files, default 64 MB (defined by

wal.SegmentSizeBytes

).

locks : handles for all WAL file descriptors managed by the instance.

fp : filePipeline instance that creates new temporary files.

WAL Creation

The

wal.Create()

function performs several initialization steps:

Create a temporary directory and a WAL file named “0-0.wal” (sequence and first index).

Pre‑allocate disk space for the file.

Write a

crcType

record, a

metadataType

record, and a

snapshotType

record.

Create the

filePipeline

associated with the WAL.

Rename the temporary directory to the final WAL directory, making the initialization appear atomic.

<code>// Create creates a WAL ready for appending records. The given metadata is
// recorded at the head of each WAL file, and can be retrieved with ReadAll.
func Create(dirpath string, metadata []byte) (*WAL, error) {
    tmpdirpath := filepath.Clean(dirpath) + ".tmp"
    if fileutil.Exist(tmpdirpath) {
        if err := os.RemoveAll(tmpdirpath); err != nil {
            return nil, err
        }
    }
    if err := fileutil.CreateDirAll(tmpdirpath); err != nil {
        return nil, err
    }
    p := filepath.Join(tmpdirpath, walName(0, 0))
    f, err := fileutil.LockFile(p, os.O_WRONLY|os.O_CREATE, fileutil.PrivateFileMode)
    if err != nil {
        return nil, err
    }
    if _, err = f.Seek(0, io.SeekEnd); err != nil {
        return nil, err
    }
    if err = fileutil.Preallocate(f.File, SegmentSizeBytes, true); err != nil {
        return nil, err
    }
    w := &WAL{
        dir:      dirpath,
        metadata: metadata,
    }
    w.encoder, err = newFileEncoder(f.File, 0)
    if err != nil {
        return nil, err
    }
    w.locks = append(w.locks, f)
    if err = w.saveCrc(0); err != nil {
        return nil, err
    }
    if err = w.encoder.encode(&walpb.Record{Type: metadataType, Data: metadata}); err != nil {
        return nil, err
    }
    if err = w.SaveSnapshot(walpb.Snapshot{}); err != nil {
        return nil, err
    }
    if w, err = w.renameWal(tmpdirpath); err != nil {
        return nil, err
    }
    pdir, perr := fileutil.OpenDir(filepath.Dir(w.dir))
    if perr != nil {
        return nil, perr
    }
    if perr = fileutil.Fsync(pdir); perr != nil {
        return nil, perr
    }
    return w, nil
}</code>

WAL file names follow the pattern “

seq-index.wal

”, generated by

walName(seq, index)

:

<code>func walName(seq, index uint64) string {
    return fmt.Sprintf("%016x-%016x.wal", seq, index)
}</code>

Records are persisted as protobuf‑encoded frames. Each frame starts with a length field that may include padding information to ensure 8‑byte alignment.

<code>func encodeFrameSize(dataBytes int) (lenField uint64, padBytes int) {
    lenField = uint64(dataBytes)
    padBytes = (8 - (dataBytes % 8)) % 8
    if padBytes != 0 {
        lenField |= uint64(0x80|padBytes) << 56
    }
    return
}</code>

An example diagram shows two WAL files.

filePipeline Type

The filePipeline uses an eager strategy, pre‑creating temporary files to speed up log file creation. It runs a background goroutine that generates “.tmp” files and provides them via a channel.

<code>type filePipeline struct {
    dir    string
    size   int64
    count  int
    filec  chan *fileutil.LockedFile
    errc   chan error
    donec  chan struct{}
}

func newFilePipeline(dir string, fileSize int64) *filePipeline {
    fp := &filePipeline{
        dir:   dir,
        size:  fileSize,
        filec: make(chan *fileutil.LockedFile),
        errc:  make(chan error, 1),
        donec: make(chan struct{}),
    }
    go fp.run()
    return fp
}

// Open returns a fresh file for writing.
func (fp *filePipeline) Open() (f *fileutil.LockedFile, err error) {
    select {
    case f = <-fp.filec:
    case err = <-fp.errc:
    }
    return
}

// Close stops the pipeline and removes the last temporary file.
func (fp *filePipeline) Close() error {
    close(fp.donec)
    return <-fp.errc
}

// alloc creates a temporary file with a rotating name.
func (fp *filePipeline) alloc() (f *fileutil.LockedFile, err error) {
    fpath := filepath.Join(fp.dir, fmt.Sprintf("%d.tmp", fp.count%2))
    f, err = fileutil.LockFile(fpath, os.O_CREATE|os.O_WRONLY, fileutil.PrivateFileMode)
    if err != nil {
        return nil, err
    }
    if err = fileutil.Preallocate(f.File, fp.size, true); err != nil {
        f.Close()
        return nil, err
    }
    fp.count++
    return f, nil
}

// run continuously creates temporary files until the pipeline is closed.
func (fp *filePipeline) run() {
    defer close(fp.errc)
    for {
        f, err := fp.alloc()
        if err != nil {
            fp.errc <- err
            return
        }
        select {
        case fp.filec <- f:
        case <-fp.donec:
            os.Remove(f.Name())
            f.Close()
            return
        }
    }
}</code>
GoRaftWALetcdwrite-ahead logdatabase logs
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.