Designing Structured Table and Index Storage in RocksDB for NewSQL Systems
This article explains how to map relational table rows and various indexes—including primary key, integer, float, string, and composite indexes—into RocksDB key‑value pairs using custom serialization, big‑endian ordering, and encoding functions to enable efficient point, range, and multi‑field queries.
NewSQL systems have increasingly adopted RocksDB as the core storage engine for both table data and indexes. Since RocksDB stores data as ordered byte‑array keys and values, the challenge is to convert structured rows and index columns into suitable key‑value representations.
RocksDB provides random get/put operations as well as iterators that can seek to a starting key and scan forward, enabling both point lookups and range scans when the data is stored in a monotonic byte order.
Primary Key Index
Each row is stored with the id field as the key and the remaining columns serialized (e.g., via a custom protocol or protobuf) as the value. This allows fast Get or MGet for queries like WHERE id = 101 or WHERE id IN (101,105,108). For range queries ( WHERE id > -100 AND id < 200) an iterator is created that seeks to the first qualifying key and scans forward.
To keep the byte order consistent with integer ordering, the most‑significant bit of non‑negative integers is flipped to 1 and that of negative integers to 0, ensuring that the binary order matches the numeric order.
var N uint16 = 0x8000
func encodeInt16(i int16) []byte {
u := uint16(i) + N
buf := make([]byte, 2)
binary.BigEndian.PutUint16(buf, u)
return buf
}
func decodeToInt16(buf []byte) (int16, error) {
if len(buf) < 2 {
return 0, fmt.Errorf("invalid buf")
}
u := binary.BigEndian.Uint16(buf)
i := int16(u - N)
return i, nil
}Integer Field Index (i_index)
For a non‑unique integer index, the key consists of the encoded i field followed by the encoded primary key id. This ordering guarantees that rows are first sorted by i and then by id. Range queries on i are handled similarly to the primary key.
If the integer index were unique, only the i field would be stored in the key and the id would be placed in the value.
Float Field Index (f_index)
Floating‑point values require special handling because their binary representation differs from integers. By flipping the sign bit for non‑negative numbers (setting it to 1) and inverting all bits for negative numbers, the resulting big‑endian byte sequence preserves the natural ordering of the float values.
func encodeFloat(f float32) []byte {
u := *(*uint32)(unsafe.Pointer(&f))
if f >= 0 {
u |= 0x80000000
} else {
u = ^u
}
buf := make([]byte, 4)
binary.BigEndian.PutUint32(buf, u)
return buf
}
func decodeToFloat(buf []byte) (float32, error) {
if len(buf) < 2 {
return 0, fmt.Errorf("invalid buf")
}
u := binary.BigEndian.Uint32(buf)
if u >= 0x80000000 {
u += 0x80000000
} else {
u = ^u
}
f := *(*float32)(unsafe.Pointer(&u))
return f, nil
}The f_index key is built from the encoded 4‑byte float followed by the encoded 2‑byte integer id, with an empty value.
String Field Index (c_index)
Variable‑length strings are padded to 8‑byte groups, each group receiving a trailing length‑indicator byte (255 minus the number of padding zeros). After processing the string, the encoded id (with its most‑significant bit flipped) is appended. This scheme prevents the id bytes from affecting the lexical order of short versus long strings.
[] -> [0,0,0,0,0,0,0,0,247] (empty string)
[1,2,3] -> [1,2,3,0,0,0,0,0,250] ("123")
[1,2,3,0] -> [1,2,3,0,0,0,0,0,251] ("123\0")
[1,2,3,4,5,6,7,8] -> [1,2,3,4,5,6,7,8,255,0,0,0,0,0,0,0,0,247]
[1,2,3,4,5,6,7,8,9] -> [1,2,3,4,5,6,7,8,255,9,0,0,0,0,0,0,0,248]Composite Indexes
For multi‑column indexes such as i_f_index (i, f) and i_c_f_index (i, c, f), the key is formed by concatenating the encoded components in the order defined by the index, each using the appropriate encoding (integer, float, string, then primary key). The value remains empty.
Conclusion
By encoding primary keys and secondary indexes into RocksDB keys with big‑endian ordering and type‑specific transformations, a NewSQL system can achieve fast point lookups, range scans, and multi‑column index queries similar to traditional MySQL engines. Further performance tuning involves query planning based on statistics, histograms, and structures like Count‑Min Sketch, as well as deep understanding of RocksDB’s get, mget, seek, and scan operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
