Building a Mini LSM‑Tree Database: Theory, Design, and Code Walkthrough
This article explains the fundamentals of LSM‑Tree storage engines, introduces SSTable concepts, describes how to build and maintain ordered log files, and provides a complete Java implementation of a tiny key‑value store with code examples and design analysis.
Introduction
LSM‑Tree is the underlying implementation of many NoSQL engines such as LevelDB and HBase. Based on the design ideas from Designing Data‑Intensive Applications , this article builds a mini database of about 500 lines of core code, combining theory and practice to help understand database principles.
1. SSTable (Sorted String Table)
Compared with a hash‑indexed database that stores the whole hash table in memory and has poor range‑query performance, an SSTable stores keys in order and ensures each key appears only once, bringing several advantages:
Merge of multiple log files becomes simple and efficient using a merge‑sort algorithm.
Key lookup does not require the entire key index in memory.
Supports range queries.
2. Building and Maintaining SSTable
Keys are kept ordered by first inserting them into an in‑memory red‑black tree (memtable). When the memtable exceeds a threshold it is flushed to disk as an SSTable. The basic workflow includes:
Write incoming keys to the memtable.
When the memtable size exceeds the threshold, write it to an SSTable file (the tree is already ordered).
A write‑ahead log (WAL) is used to recover from crashes; the WAL does not need to be ordered.
Read requests first check the memtable, then the immutable memtable, then SSTables from newest to oldest.
A background process periodically merges and compacts SSTables, discarding overwritten or deleted values.
3. Mini LSM‑Tree Implementation
The implementation is called TinyKvStore. Key design points include:
Memtable storage structure – store commands (set/rm) rather than raw values so that delete operations can be represented without disabling writes.
SSTable file format – consists of a data area, a sparse index area, and a file‑meta area. JSON is used for readability, while a production version would use compression.
Core code – persisting a memtable to an SSTable, reading meta‑information, and querying.
4. Code Walkthrough
Key methods for persisting a memtable to an SSTable:
/**
* From memtable to SSTable
* @param index
*/
private void initFromIndex(TreeMap<String, Command> index) {
try {
JSONObject partData = new JSONObject(true);
tableMetaInfo.setDataStart(tableFile.getFilePointer());
for (Command command : index.values()) {
if (command instanceof SetCommand) {
SetCommand set = (SetCommand) command;
partData.put(set.getKey(), set);
}
if (command instanceof RmCommand) {
RmCommand rm = (RmCommand) command;
partData.put(rm.getKey(), rm);
}
if (partData.size() >= tableMetaInfo.getPartSize()) {
writeDataPart(partData);
}
}
if (partData.size() > 0) {
writeDataPart(partData);
}
long dataPartLen = tableFile.getFilePointer() - tableMetaInfo.getDataStart();
tableMetaInfo.setDataLen(dataPartLen);
// save sparse index
byte[] indexBytes = JSONObject.toJSONString(sparseIndex).getBytes(StandardCharsets.UTF_8);
tableMetaInfo.setIndexStart(tableFile.getFilePointer());
tableFile.write(indexBytes);
tableMetaInfo.setIndexLen(indexBytes.length);
// save file meta
tableMetaInfo.writeToFile(tableFile);
} catch (Throwable t) {
throw new RuntimeException(t);
}
}Reading meta‑information (written in reverse order):
public static TableMetaInfo readFromFile(RandomAccessFile file) {
try {
TableMetaInfo tableMetaInfo = new TableMetaInfo();
long fileLen = file.length();
file.seek(fileLen - 8);
tableMetaInfo.setVersion(file.readLong());
file.seek(fileLen - 8 * 2);
tableMetaInfo.setIndexLen(file.readLong());
file.seek(fileLen - 8 * 3);
tableMetaInfo.setIndexStart(file.readLong());
file.seek(fileLen - 8 * 4);
tableMetaInfo.setDataLen(file.readLong());
file.seek(fileLen - 8 * 5);
tableMetaInfo.setDataStart(file.readLong());
file.seek(fileLen - 8 * 6);
tableMetaInfo.setPartSize(file.readLong());
return tableMetaInfo;
} catch (Throwable t) {
throw new RuntimeException(t);
}
}Querying a key from an SSTable:
public Command query(String key) {
try {
LinkedList<Position> sparseKeyPositionList = new LinkedList<>();
Position lastSmall = null, firstBig = null;
for (String k : sparseIndex.keySet()) {
if (k.compareTo(key) <= 0) {
lastSmall = sparseIndex.get(k);
} else {
firstBig = sparseIndex.get(k);
break;
}
}
if (lastSmall != null) sparseKeyPositionList.add(lastSmall);
if (firstBig != null) sparseKeyPositionList.add(firstBig);
if (sparseKeyPositionList.isEmpty()) return null;
Position first = sparseKeyPositionList.getFirst();
Position last = sparseKeyPositionList.getLast();
long start = first.getStart();
long len = (first.equals(last)) ? first.getLen() : last.getStart() + last.getLen() - start;
byte[] dataPart = new byte[(int) len];
tableFile.seek(start);
tableFile.read(dataPart);
int pStart = 0;
for (Position pos : sparseKeyPositionList) {
JSONObject partJson = JSONObject.parseObject(new String(dataPart, pStart, (int) pos.getLen()));
if (partJson.containsKey(key)) {
JSONObject value = partJson.getJSONObject(key);
return ConvertUtil.jsonToCommand(value);
}
pStart += (int) pos.getLen();
}
return null;
} catch (Throwable t) {
throw new RuntimeException(t);
}
}Conclusion
Implementing a small LSM‑Tree database clarifies why the design choices—ordered logs, immutable SSTables, WAL, and background compaction—are made. The article also suggests further improvements such as asynchronous persistence, log compression, and Bloom‑filter‑based query pruning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
