Understanding Database Write-Ahead Logs (WAL) and Their Implementation in etcd
This article explains common database logging mechanisms such as MySQL redo logs and binlogs, compares them with Redis AOF and etcd's Raft‑based WAL, and provides an in‑depth analysis of etcd's WAL source code, including key structures, creation process, record types, encoding, and file pipeline management.
Part1 Common Database Logs
Traditional database logs such as redo log record modified data. This is essentially the WAL technique used in MySQL, where the log is written before the data is persisted to disk.
redo log is specific to the InnoDB engine; binlog is implemented at the MySQL server layer and can be used by all engines. redo log is a physical log that records what changes were made to a data page.
binlog is a logical log that records the original logic of the statement, e.g., “increment field c of row with ID=2 by 1”. redo log is written in a circular fashion and has a fixed size.
binlog can be appended; when a binlog file reaches a certain size it switches to the next file without overwriting previous logs.
Redis uses Append Only File (AOF), which saves each write command after it succeeds, allowing recovery by replaying the commands. AOF records each command received by Redis as plain text.
etcd validates commands, and the leader stores proposals and log entries via the Raft module in a WAL file to guarantee consistency and recoverability.
Part2 WAL Source Code Analysis
When etcd server starts, it checks for a WAL directory to determine if a previous WAL exists. If not, it calls
wal.Createto create one; otherwise it uses
wal.Openand
wal.ReadAllto reload the existing WAL. The logic resides in
etcd/etcdserver/server.gowithin the
NewServermethod.
Key WAL Objects
Important fields:
dir : directory where WAL files are stored.
dirFile : file descriptor for the opened directory.
metadata : byte sequence supplied when creating the WAL, typically containing node and cluster IDs; written to the WAL header.
state : hardState information saved during WAL appends; updated whenever Raft's hardState changes and flushed to disk.
hardState : used after etcd restarts to recover the last hardState; defined as:
<code>type HardState struct {
Term uint64 `protobuf:"varint,1,opt,name=term" json:"term"`
Vote uint64 `protobuf:"varint,2,opt,name=vote" json:"vote"`
Commit uint64 `protobuf:"varint,3,opt,name=commit" json:"commit"`
XXX_unrecognized []byte `json:"-"`
}</code>snapshot : metadata about the last snapshot stored in the WAL; defined as:
<code>type Snapshot struct {
Index uint64 `protobuf:"varint,1,opt,name=index" json:"index"`
Term uint64 `protobuf:"varint,2,opt,name=term" json:"term"`
XXX_unrecognized []byte `json:"-"`
}</code>decoder : deserializes protobuf records when reading WAL files.
readClose : closes the decoder’s reader after
ReadAll.
enti : index of the last entry saved to the WAL.
encoder : serializes records before writing them to the WAL.
size : pre‑allocation size for temporary files, default 64 MB (defined by
wal.SegmentSizeBytes).
locks : handles for all WAL file descriptors managed by the instance.
fp : filePipeline instance that creates new temporary files.
WAL Creation
The
wal.Create()function performs several initialization steps:
Create a temporary directory and a WAL file named “0-0.wal” (sequence and first index).
Pre‑allocate disk space for the file.
Write a
crcTyperecord, a
metadataTyperecord, and a
snapshotTyperecord.
Create the
filePipelineassociated with the WAL.
Rename the temporary directory to the final WAL directory, making the initialization appear atomic.
<code>// Create creates a WAL ready for appending records. The given metadata is
// recorded at the head of each WAL file, and can be retrieved with ReadAll.
func Create(dirpath string, metadata []byte) (*WAL, error) {
tmpdirpath := filepath.Clean(dirpath) + ".tmp"
if fileutil.Exist(tmpdirpath) {
if err := os.RemoveAll(tmpdirpath); err != nil {
return nil, err
}
}
if err := fileutil.CreateDirAll(tmpdirpath); err != nil {
return nil, err
}
p := filepath.Join(tmpdirpath, walName(0, 0))
f, err := fileutil.LockFile(p, os.O_WRONLY|os.O_CREATE, fileutil.PrivateFileMode)
if err != nil {
return nil, err
}
if _, err = f.Seek(0, io.SeekEnd); err != nil {
return nil, err
}
if err = fileutil.Preallocate(f.File, SegmentSizeBytes, true); err != nil {
return nil, err
}
w := &WAL{
dir: dirpath,
metadata: metadata,
}
w.encoder, err = newFileEncoder(f.File, 0)
if err != nil {
return nil, err
}
w.locks = append(w.locks, f)
if err = w.saveCrc(0); err != nil {
return nil, err
}
if err = w.encoder.encode(&walpb.Record{Type: metadataType, Data: metadata}); err != nil {
return nil, err
}
if err = w.SaveSnapshot(walpb.Snapshot{}); err != nil {
return nil, err
}
if w, err = w.renameWal(tmpdirpath); err != nil {
return nil, err
}
pdir, perr := fileutil.OpenDir(filepath.Dir(w.dir))
if perr != nil {
return nil, perr
}
if perr = fileutil.Fsync(pdir); perr != nil {
return nil, perr
}
return w, nil
}</code>WAL file names follow the pattern “
seq-index.wal”, generated by
walName(seq, index):
<code>func walName(seq, index uint64) string {
return fmt.Sprintf("%016x-%016x.wal", seq, index)
}</code>Records are persisted as protobuf‑encoded frames. Each frame starts with a length field that may include padding information to ensure 8‑byte alignment.
<code>func encodeFrameSize(dataBytes int) (lenField uint64, padBytes int) {
lenField = uint64(dataBytes)
padBytes = (8 - (dataBytes % 8)) % 8
if padBytes != 0 {
lenField |= uint64(0x80|padBytes) << 56
}
return
}</code>An example diagram shows two WAL files.
filePipeline Type
The filePipeline uses an eager strategy, pre‑creating temporary files to speed up log file creation. It runs a background goroutine that generates “.tmp” files and provides them via a channel.
<code>type filePipeline struct {
dir string
size int64
count int
filec chan *fileutil.LockedFile
errc chan error
donec chan struct{}
}
func newFilePipeline(dir string, fileSize int64) *filePipeline {
fp := &filePipeline{
dir: dir,
size: fileSize,
filec: make(chan *fileutil.LockedFile),
errc: make(chan error, 1),
donec: make(chan struct{}),
}
go fp.run()
return fp
}
// Open returns a fresh file for writing.
func (fp *filePipeline) Open() (f *fileutil.LockedFile, err error) {
select {
case f = <-fp.filec:
case err = <-fp.errc:
}
return
}
// Close stops the pipeline and removes the last temporary file.
func (fp *filePipeline) Close() error {
close(fp.donec)
return <-fp.errc
}
// alloc creates a temporary file with a rotating name.
func (fp *filePipeline) alloc() (f *fileutil.LockedFile, err error) {
fpath := filepath.Join(fp.dir, fmt.Sprintf("%d.tmp", fp.count%2))
f, err = fileutil.LockFile(fpath, os.O_CREATE|os.O_WRONLY, fileutil.PrivateFileMode)
if err != nil {
return nil, err
}
if err = fileutil.Preallocate(f.File, fp.size, true); err != nil {
f.Close()
return nil, err
}
fp.count++
return f, nil
}
// run continuously creates temporary files until the pipeline is closed.
func (fp *filePipeline) run() {
defer close(fp.errc)
for {
f, err := fp.alloc()
if err != nil {
fp.errc <- err
return
}
select {
case fp.filec <- f:
case <-fp.donec:
os.Remove(f.Name())
f.Close()
return
}
}
}</code>Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.