Performance Optimization of TXRocks Sum Operation: Pushdown, Record Conversion, and Multithreaded Concurrency
This article analyzes the high space efficiency but poor sum performance of TXRocks compared to InnoDB, identifies three main bottlenecks, and details three optimizations—sum push‑down, selective record conversion, and multithreaded execution—that together reduce query latency to less than 5% of InnoDB’s original time.
TXRocks is the TXSQL adaptation of RocksDB, heavily customized from Facebook’s open‑source MySQL. Compared with the commonly used InnoDB engine, RocksDB consumes roughly half the space of compressed InnoDB because its data pages are stored in a compressed append‑only format and its B+‑tree pages contain fewer gaps. In an internal project with highly redundant numeric data, TXRocks occupied only one‑tenth of the space of compressed InnoDB, but its SUM query performance was only about 60% of InnoDB (InnoDB 38.29 s vs. TXRocks 62.8 s).
Root‑cause analysis revealed three main reasons for the poor performance: (1) the server layer iterates records one by one, incurring high traversal cost; (2) the engine layer parses every column of each record even though SUM only needs a few columns; (3) the server processes aggregation in a single thread, limiting concurrency. The optimization plan therefore focused on three aspects: pushing the SUM operation down to the storage engine, reducing per‑record parsing overhead, and introducing multithreaded aggregation.
Problem Diagnosis
Flame‑graph profiling showed that most CPU time was spent in three functions: myrocks::ha_rocksdb::convert_record_from_storage_format , rocksdb::DBIter::Next , and the call chain from sub_select to rnd_next_with_direction . The top command confirmed that the MySQL daemon was using a single CPU core at 100%.
The identified bottlenecks and corresponding solutions were:
Only decode the columns required for SUM instead of all columns.
Leave the RocksDB iterator behavior unchanged.
Push the SUM calculation down to the engine layer.
Enable multithreaded aggregation.
Optimization 1: SUM Push‑Down
The SUM operation was moved from the SQL layer to the engine layer for integer columns. The SQL side sets a flag when the push‑down condition is met; the engine’s rnd_next checks this flag and performs aggregation directly while iterating.
Push‑down conditions:
Queries with ORDER BY, GROUP BY, HAVING, WHERE, or involving multiple tables are not pushed down.
Non‑SUM/COUNT aggregates are not pushed down.
Aggregates involving multiple fields are not pushed down.
Engine‑side handling (excerpt):
if (agg_sum_push_down)
{
/* 1. Iterate over the whole table */
for (;;) {
...
scan_it->Next();
/* 2. End of iteration */
if (!scan_it->Valid()) { rc = HA_ERR_END_OF_FILE; break; }
...
/* 3. Decode only needed fields */
rc = convert_needed_filed_from_value(&value, value, field_is_null);
...
/* 4. Check overflow */
if (agg_sum_is_overflow(local_sum, value)) { rc = HA_ERR_ROCKSDB_STATUS_TRY_AGAIN; break; }
/* 5. Accumulate */
local_sum += value;
}
}
else
{
/* Normal row fetch */
}Optimization 2: Selective Record Conversion
The function convert_record_from_storage_format originally decoded every column of a record. Since the business scenario only required a single column for SUM, the code was modified to decode only the needed column, dramatically reducing CPU work.
for (auto it = m_decoders_vect.begin(); it != m_decoders_vect.end(); ++it) {
const Rdb_field_encoder *field_dec = it->m_field_enc;
Field *field = table->field[field_dec->m_field_index];
if (isNull) {
if (decode) {
field->set_null();
memcpy(field->ptr, table->s->default_values + field_offset, field->pack_length());
}
} else {
if (decode) field->set_notnull();
if (field_dec->m_field_type == MYSQL_TYPE_BLOB) {
err = convert_blob_from_storage_format((my_core::Field_blob *)field, &reader, decode);
} else if (field_dec->m_field_type == MYSQL_TYPE_VARCHAR) {
err = convert_varchar_from_storage_format((my_core::Field_varstring *)field, &reader, decode);
} else {
err = convert_field_from_storage_format(field, &reader, decode, field_dec->m_pack_length_in_rec);
}
}
}Optimization 3: Multithreaded Concurrency
To exploit multiple CPU cores, the data set was split among threads. Two splitting strategies were considered: static equal‑size ranges and histogram‑based ranges derived from RocksDB’s VersionStorageInfo . The histogram approach, which selects the level with the most records and partitions its files based on record count, yielded more balanced workloads.
1. Retrieve storage_info;
2. Determine the number of LSM levels;
3. Choose the level for splitting:
a) If only level 0 exists, do not split (single‑thread);
b) If a single non‑zero level exists, use it;
c) Otherwise, pick the level with the most records;
4. Compute available CPU threads;
5. Distribute records of the chosen level evenly across threads;
6. Split files of that level into thread‑specific ranges.Optimization Results
After applying the three optimizations, the SUM query latency dropped to 1.74 s, less than 5% of the original InnoDB latency. Updated perf graphs and latency charts (shown in the original article) confirm the dramatic reduction in CPU time and overall query duration.
Conclusion
The article demonstrates that by pushing down aggregation, decoding only required columns, and employing multithreaded execution, TXRocks’ SUM performance can be improved from 60% of InnoDB to under 5% of its original latency. The same techniques have also been applied to COUNT operations, and further optimizations are planned.
Tencent Database Technology
Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.