Databases 17 min read

How Valkey 8.0 Boosts Single-Node Performance with Async IO, Prefetch, and MAA

Valkey 8.0 introduces asynchronous IO threads, data prefetch, and memory access amortization (MAA), offloading many tasks from the main thread and dramatically increasing single‑node throughput to up to 1 million requests per second, while also improving memory utilization and overall system efficiency.

DeWu Technology

Aug 6, 2025

How Valkey 8.0 Boosts Single-Node Performance with Async IO, Prefetch, and MAA

Background

In September 2024 the Valkey community released Valkey 8.0. Building on the earlier article "Redis is single‑threaded?", this version adds asynchronous IO threads, data prefetch, and memory‑access amortization (MAA), raising single‑node request capacity from ~200 k/s to over 1 M/s.

Async IO Thread Background

Redis 6.0 introduced multithreaded IO to handle network read/write and protocol parsing, but the main thread still waited for all IO threads, limiting performance. Limitations included idle IO threads while the main thread blocked and the main thread bearing most IO work.

Redis 6.0 Multithread IO

Read data flow: the main thread queues readable clients, then distributes them to IO threads via a round‑robin algorithm. Write flow is similar. Although performance doubled, the design left the main thread as a bottleneck.

Valkey 8.0 Async IO Thread

Valkey creates a static lock‑free ring buffer (size 2048) for each IO thread as a task queue. When the server starts, server.io_threads_num determines the number of IO threads (max 15). Tasks such as read/write, event loop handling, and object memory release are offloaded to these threads.

IO threads dynamically adjust their count based on pending read/write events, never exceeding the configured maximum.

Unloading More Tasks to IO Threads

Beyond network IO, Valkey 8.0 offloads event polling (epoll_wait), object deallocation, command lookup, and other time‑consuming actions to IO threads, reducing main‑thread load.

Data Prefetch (Prefetch)

Prefetching uses __builtin_prefetch() to load command arguments, keys, and values into the CPU cache before execution, mitigating the memory‑CPU speed gap.

Memory Access Amortization (MAA)

MAA interleaves memory accesses of multiple keys: while one key waits for memory, another key’s data is prefetched, allowing parallel memory operations and reducing average latency.

Prefetch and MAA in Practice

Valkey batches up to 16 commands per client, prefetches dictionary entries and values for all keys in the batch, and executes them after all prefetches complete. This interleaved, batched approach raises single‑node QPS to around 1.2 M.

Summary

By introducing asynchronous IO threads, data prefetch, and MAA, Valkey 8.0 dramatically improves single‑node performance, memory utilization, replication efficiency, and overall system robustness, making it competitive with larger Redis clusters.