How I Built a High‑Performance Java KV Store for a Competition – Architecture and Optimizations
This article details the design, implementation, and performance tuning of a Java‑based key‑value storage engine built for a competition, covering the competition requirements, interface definitions, data layout, recovery logic, multi‑threaded read/write strategies, Direct I/O techniques, JVM tuning, and practical lessons learned.
Introduction
I participated in a KV‑store competition using Java, ranked 21st, and in this article I share the overall architecture and Java‑specific best practices for building a high‑performance storage engine.
Competition Overview
The contest consists of a preliminary and a final round. Participants must implement a simplified, efficient KV store supporting write(byte[] key, byte[] value), read(byte[] key), and in the final round an additional range(byte[] lower, byte[] upper, AbstractVisitor visitor) interface.
Evaluation Criteria
Two phases are evaluated:
Correctness: the engine must survive repeated kill -9 crashes and recover without data loss.
Performance: random writes, random reads, and sequential (range) reads are measured with 64 concurrent threads.
Key and Value Design
Keys are fixed at 8 bytes, represented as long. Values are 4 KB, which aligns well with disk I/O and allows the logical offset to be stored as an int (offset = physicalOffset / 4096), halving memory usage.
Handling kill‑9
Since the test forces process termination, we use the OS page cache as a write buffer and avoid asynchronous I/O. For durability we mmap a region and flush it in 4‑value batches; this approach is faster than writing 4 KB blocks directly.
Hidden Evaluation Details
After each phase the test script clears the page cache (via sysctl -w vm.drop_caches=1/2/3) and counts that time toward the final score, making excessive page‑cache usage a penalty and highlighting Direct I/O as a “silver bullet”.
Key Distribution and Partitioning
The 64‑million keys are uniformly distributed. By hashing the high 10 bits of the key we obtain 1024 partitions, each stored in its own file. This enables converting random writes into sequential writes within a partition.
public class HighTenPartitioner implements Partitionable {
@Override
public int getPartition(byte[] key) {
return ((key[0] & 0xff) << 2) | ((key[1] & 0xff) >> 6));
}
}Architecture Overview
Three perspectives:
Global view : 1024 partitions, each containing an index file and a data file.
Partition view : Within a partition, writes are appended sequentially; the index maps keys to logical offsets.
Memory view : Two in‑memory arrays hold key[1024][625000] and offset[1024][625000].
Random Write Flow
We allocate a write buffer per partition using mmap, batch four values before flushing, and store the index via mmap as well. No in‑memory index is maintained during the write phase; the recover phase reconstructs offsets from key order.
ByteBuffer writeBuffer = ByteBuffer.allocateDirect(4 * 4096);
// fill buffer, then force to diskRecover Flow
On engine startup we open the data files, load keys and offsets into the two arrays, and sort each partition using 64 threads. Sorting is performed on separate key and offset arrays to avoid struct overhead and improve CPU cache utilization.
public class KeyOffset {
long key;
int offset;
}Random Read Flow
For a given key we locate its partition, binary‑search the sorted key array to obtain the offset, then read the value directly from the data file.
Sequential (Range) Read Flow
Range is the performance bottleneck. We use a producer‑consumer model: four fetch threads load entire partitions into off‑heap buffers (≈256 MB each), while 64 visit threads concurrently consume the buffers and invoke the visitor callback. This design maximizes disk‑to‑memory throughput and keeps CPU busy.
public void acquire() throws InterruptedException {
while (!permit) {
Thread.sleep(0, 1);
}
permit = false;
}Direct I/O in Java
Java lacks native Direct I/O, so we wrap the required syscalls via JNA, following the jaydio approach. Two crucial points: allocate aligned memory (via posix_memalign) and ensure writes are page‑size multiples using pread with O_DIRECT.
Direct Memory vs Heap Memory
Off‑heap buffers avoid extra copies and reduce GC pressure, especially for the large 256 MB range buffers.
JVM Tuning
-server -Xms2560m -Xmx2560m -XX:MaxDirectMemorySize=1024m -XX:NewRatio=4 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:-UseBiasedLockingIncreasing the old generation (NewRatio=4) cuts CMS GC cycles from 126 to 5, significantly improving latency.
Object Pooling
Instead of heavyweight pools we use ThreadLocal caches for frequently allocated objects (e.g., key/value wrappers), eliminating per‑iteration allocations.
Reducing Thread Switching
A custom semaphore with busy‑wait and micro‑sleep reduces contention between the 4 I/O threads and the 64 visit threads, yielding a ~6 s performance gain.
Core Binding
We pin the four I/O threads to dedicated CPU cores using the OpenHFT Affinity library, preventing scheduler interference and adding 1–2 s speedup.
FileChannel, MMAP, and Direct I/O Discussion
While FileChannel leverages the page cache, Direct I/O can be combined with limited page‑cache usage for consistency. The insights from this competition suggest that a hybrid approach can benefit production storage engines.
Conclusion
Using Java, I achieved a final score only ~10 s slower than C++ competitors across 64 million random writes, reads, and sequential reads, demonstrating that with careful architecture, Direct I/O, and JVM tuning, Java can approach native performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
