Backend Development 31 min read

Design and Implementation of the New Tokio Scheduler: Performance Improvements and Optimizations

This article details the redesign of Tokio's Rust async scheduler, describing its new task system, queue algorithms, task‑stealing strategy, reduced synchronization, memory‑allocation optimizations, Loom‑based concurrency testing, and the resulting ten‑fold performance gains in benchmarks and real‑world workloads.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Design and Implementation of the New Tokio Scheduler: Performance Improvements and Optimizations

Tokio is Rust's asynchronous runtime, and a completely rewritten scheduler has been released that brings massive performance improvements—up to ten times faster in some micro‑benchmarks and significant gains for real‑world workloads such as Hyper and Tonic.

How the Scheduler Works

A scheduler distributes work units called tasks . Each task is either runnable or blocked, and the scheduler repeatedly pulls runnable tasks from a queue and runs them on OS threads. Tokio runs Rust Future s, which act as green threads in an M:N model, multiplexing many user‑space tasks onto a few system threads.

The core abstraction can be expressed as:

while let Some(task) = self.queue.pop() {
    task.run();
}

When a task becomes runnable it is inserted into the queue.

Single Queue + Multiple Workers

All workers share a global task queue. Producers push tasks to the tail, workers pop from the head. The queue uses an intrusive linked list to avoid allocations; pushes are lock‑free, pops use a semaphore for coordination.

Fair FIFO scheduling.

Simple implementation.

However, FIFO fairness can hurt parallel‑join workloads, and contention on the head can become a bottleneck for short‑lived async tasks.

Multiple Queues (One per Worker)

Each worker owns its own queue, eliminating most synchronization. To allow any thread to submit tasks, each queue must support thread‑safe insertion (MPSC) or have separate sync/async queues.

This model resembles Seastar’s strategy and offers high performance when load is balanced, but can suffer from load imbalance when tasks stick to particular workers.

Task‑Stealing Scheduler

Workers first consume their own queue; when empty they attempt to steal tasks from peers. Randomized victim selection reduces contention, and a throttle‑stealing algorithm limits the number of concurrent stealers to roughly half the worker count.

Next‑Generation Tokio Scheduler

New Task System

The standard library now provides a new std::future task system designed by Taylor Cramer. It introduces a compact Waker with two pointers, allowing the scheduler to push tasks with minimal copying and better cache utilization.

Improved Task Queue

The old scheduler used Crossbeam’s unbounded deque, which required atomic RMW operations and occasional reallocations. The new scheduler fixes the local queue size and pushes overflow tasks to a global MPMC queue protected by a mutex, dramatically reducing contention.

struct Queue {
    /// Concurrently updated by many threads.
    head: AtomicU32,
    /// Only updated by producer thread but read by many threads.
    tail: AtomicU32,
    /// Masks the head / tail position value to obtain the index in the buffer.
    mask: usize,
    /// Stores the tasks.
    buffer: Box<[MaybeUninit
]>,
}

Enqueue operation (single‑producer thread):

loop {
    let head = self.head.load(Acquire);
    let tail = self.tail.unsync_load();
    if tail.wrapping_sub(head) < self.buffer.len() as u32 {
        let idx = tail as usize & mask;
        self.buffer[idx].as_mut_ptr().write(task);
        self.tail.store(tail.wrapping_add(1), Release);
        return;
    }
    match self.push_overflow(task, head, tail, global) {
        Ok(_) => return,
        Err(v) => task = v,
    }
}

Dequeue operation (single‑consumer thread):

loop {
    let head = self.head.load(Acquire);
    let tail = self.tail.unsync_load();
    if head == tail { return None; }
    let idx = head as usize & mask;
    let task = self.buffer[idx].as_ptr().read();
    let actual = self.head.compare_and_swap(head, head.wrapping_add(1), Release);
    if actual == head { return Some(task.assume_init()); }
}

Optimized Message‑Passing Pattern

When a task sends a message on a channel, the receiver becomes runnable. The new scheduler stores the newly runnable task in a "next task" slot instead of enqueuing it at the tail, ensuring immediate execution and reducing latency.

Reducing Cross‑Thread Synchronization

Notifications are now throttled: a worker only notifies another when no other worker is already searching for work. This limits wake‑up storms and improves CPU utilization.

Memory‑Allocation Reductions

The new task struct eliminates an extra heap allocation by storing the future directly in the task structure, separating hot (header) and cold (trailer) data to keep the frequently accessed part cache‑friendly.

struct Task
{
    header: Header,
    future: T,
    trailer: Trailer,
}

Reducing Atomic Reference Counting

The scheduler now prefers wake(self) over wake_by_ref , avoiding unnecessary Arc::clone operations. When possible, the scheduler maintains a list of active tasks to reuse references without extra atomics.

Using Loom for Fearless Concurrency Testing

Loom exhaustively explores all possible interleavings of concurrent operations, checking memory safety and correct allocation/deallocation. Example test:

#[test]
fn multi_spawn() {
    loom::model(|| {
        let pool = ThreadPool::new();
        let c1 = Arc::new(AtomicUsize::new(0));
        let (tx, rx) = oneshot::channel();
        let tx1 = Arc::new(Mutex::new(Some(tx)));
        // spawn tasks …
        rx.recv();
    });
}

Loom uncovered dozens of subtle bugs that traditional unit tests missed, giving the Tokio team confidence in the new scheduler.

Performance Results

Micro‑benchmarks show dramatic speedups (e.g., chained_spawn from ~2 ms to ~0.17 ms). Real‑world Hyper benchmark improves latency from 371 µs to 275 µs and throughput from 113 k req/s to 152 k req/s. Tonic also sees >10 % gains.

Conclusion

The rewritten Tokio scheduler demonstrates that careful reduction of synchronization, memory allocation, and queue contention can yield order‑of‑magnitude performance improvements, while Loom‑based testing ensures correctness of the highly concurrent code.

performanceconcurrencyRusttokioSchedulerasync
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.