Databases 17 min read

Building a Mini LSM‑Tree Database: Theory, Design, and Code Walkthrough

This article explains the fundamentals of LSM‑Tree storage engines, introduces SSTable concepts, describes how to build and maintain ordered log files, and provides a complete Java implementation of a tiny key‑value store with code examples and design analysis.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Building a Mini LSM‑Tree Database: Theory, Design, and Code Walkthrough

Introduction

LSM‑Tree is the underlying implementation of many NoSQL engines such as LevelDB and HBase. Based on the design ideas from Designing Data‑Intensive Applications , this article builds a mini database of about 500 lines of core code, combining theory and practice to help understand database principles.

1. SSTable (Sorted String Table)

Compared with a hash‑indexed database that stores the whole hash table in memory and has poor range‑query performance, an SSTable stores keys in order and ensures each key appears only once, bringing several advantages:

Merge of multiple log files becomes simple and efficient using a merge‑sort algorithm.

Key lookup does not require the entire key index in memory.

Supports range queries.

2. Building and Maintaining SSTable

Keys are kept ordered by first inserting them into an in‑memory red‑black tree (memtable). When the memtable exceeds a threshold it is flushed to disk as an SSTable. The basic workflow includes:

Write incoming keys to the memtable.

When the memtable size exceeds the threshold, write it to an SSTable file (the tree is already ordered).

A write‑ahead log (WAL) is used to recover from crashes; the WAL does not need to be ordered.

Read requests first check the memtable, then the immutable memtable, then SSTables from newest to oldest.

A background process periodically merges and compacts SSTables, discarding overwritten or deleted values.

3. Mini LSM‑Tree Implementation

The implementation is called TinyKvStore. Key design points include:

Memtable storage structure – store commands (set/rm) rather than raw values so that delete operations can be represented without disabling writes.

SSTable file format – consists of a data area, a sparse index area, and a file‑meta area. JSON is used for readability, while a production version would use compression.

Core code – persisting a memtable to an SSTable, reading meta‑information, and querying.

4. Code Walkthrough

Key methods for persisting a memtable to an SSTable:

/**
 * From memtable to SSTable
 * @param index
 */
private void initFromIndex(TreeMap<String, Command> index) {
    try {
        JSONObject partData = new JSONObject(true);
        tableMetaInfo.setDataStart(tableFile.getFilePointer());
        for (Command command : index.values()) {
            if (command instanceof SetCommand) {
                SetCommand set = (SetCommand) command;
                partData.put(set.getKey(), set);
            }
            if (command instanceof RmCommand) {
                RmCommand rm = (RmCommand) command;
                partData.put(rm.getKey(), rm);
            }
            if (partData.size() >= tableMetaInfo.getPartSize()) {
                writeDataPart(partData);
            }
        }
        if (partData.size() > 0) {
            writeDataPart(partData);
        }
        long dataPartLen = tableFile.getFilePointer() - tableMetaInfo.getDataStart();
        tableMetaInfo.setDataLen(dataPartLen);
        // save sparse index
        byte[] indexBytes = JSONObject.toJSONString(sparseIndex).getBytes(StandardCharsets.UTF_8);
        tableMetaInfo.setIndexStart(tableFile.getFilePointer());
        tableFile.write(indexBytes);
        tableMetaInfo.setIndexLen(indexBytes.length);
        // save file meta
        tableMetaInfo.writeToFile(tableFile);
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

Reading meta‑information (written in reverse order):

public static TableMetaInfo readFromFile(RandomAccessFile file) {
    try {
        TableMetaInfo tableMetaInfo = new TableMetaInfo();
        long fileLen = file.length();
        file.seek(fileLen - 8);
        tableMetaInfo.setVersion(file.readLong());
        file.seek(fileLen - 8 * 2);
        tableMetaInfo.setIndexLen(file.readLong());
        file.seek(fileLen - 8 * 3);
        tableMetaInfo.setIndexStart(file.readLong());
        file.seek(fileLen - 8 * 4);
        tableMetaInfo.setDataLen(file.readLong());
        file.seek(fileLen - 8 * 5);
        tableMetaInfo.setDataStart(file.readLong());
        file.seek(fileLen - 8 * 6);
        tableMetaInfo.setPartSize(file.readLong());
        return tableMetaInfo;
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

Querying a key from an SSTable:

public Command query(String key) {
    try {
        LinkedList<Position> sparseKeyPositionList = new LinkedList<>();
        Position lastSmall = null, firstBig = null;
        for (String k : sparseIndex.keySet()) {
            if (k.compareTo(key) <= 0) {
                lastSmall = sparseIndex.get(k);
            } else {
                firstBig = sparseIndex.get(k);
                break;
            }
        }
        if (lastSmall != null) sparseKeyPositionList.add(lastSmall);
        if (firstBig != null) sparseKeyPositionList.add(firstBig);
        if (sparseKeyPositionList.isEmpty()) return null;
        Position first = sparseKeyPositionList.getFirst();
        Position last = sparseKeyPositionList.getLast();
        long start = first.getStart();
        long len = (first.equals(last)) ? first.getLen() : last.getStart() + last.getLen() - start;
        byte[] dataPart = new byte[(int) len];
        tableFile.seek(start);
        tableFile.read(dataPart);
        int pStart = 0;
        for (Position pos : sparseKeyPositionList) {
            JSONObject partJson = JSONObject.parseObject(new String(dataPart, pStart, (int) pos.getLen()));
            if (partJson.containsKey(key)) {
                JSONObject value = partJson.getJSONObject(key);
                return ConvertUtil.jsonToCommand(value);
            }
            pStart += (int) pos.getLen();
        }
        return null;
    } catch (Throwable t) {
        throw new RuntimeException(t);
    }
}

Conclusion

Implementing a small LSM‑Tree database clarifies why the design choices—ordered logs, immutable SSTables, WAL, and background compaction—are made. The article also suggests further improvements such as asynchronous persistence, log compression, and Bloom‑filter‑based query pruning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaLSM‑TreeSSTableDatabase Internalswrite-ahead logkey-value store
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.