Backend Development 45 min read

Understanding Coroutines: Principles, Implementations, and Performance in C/C++

This article explains the concept of coroutines as lightweight user‑level threads, compares them with traditional threads, details various implementation mechanisms in C/C++ (including libco and NtyCo), and demonstrates how they improve I/O‑bound server performance through examples and code snippets.

Deepin Linux

Nov 17, 2023

Understanding Coroutines: Principles, Implementations, and Performance in C/C++

1. Introduction to Coroutines

Coroutines, also known as lightweight threads, micro‑threads, or fibers, are user‑level constructs that allow functions to be paused and resumed quickly without kernel involvement, making them ideal for I/O‑intensive tasks.

Unlike normal function calls that follow a strict call‑and‑return stack, a coroutine can suspend its execution at any point, switch to another coroutine, and later resume from the same point, similar to a CPU interrupt.

Example of Coroutine Switching

print '1'
print '2'
print '3'
def B():
    print 'x'
    print 'y'
    print 'z'

When executed as coroutines, the output may interleave as:

1
2
x
y
3
z

2. Advantages of Coroutines

Very high execution efficiency because context switches are controlled by the program, avoiding the overhead of kernel thread switches.

No need for locking mechanisms when only one thread runs multiple coroutines, eliminating contention on shared resources.

To utilize multiple CPU cores, combine coroutines with multiple processes.

3. Coroutine Support in Languages

Native support exists in C++20, Go, Python, etc. Other languages provide coroutine libraries (e.g., Tencent's fiber, libco).

4. C/C++ Coroutines

C++20 introduces native coroutine support, but compiler and library compatibility is still catching up. Most C/C++ coroutine libraries rely on two approaches: assembly‑level context switching or OS‑provided APIs.

Typical Libraries

libco, Boost.context – assembly based

phxrpc – based on ucontext/Boost.context

libmill – based on setjmp/longjmp

Low‑Level Implementation Mechanisms

Assembly‑based context switch (fastest)

Switch‑case state machines

OS APIs: Linux ucontext, Windows Fiber

setjmp/longjmp with static locals

Example of setjmp/longjmp

#include <stdio.h>
#include <setjmp.h>

jmp_buf buf;

void banana(){
    printf("in banana() 
");
    longjmp(buf,1);
    printf("you'll never see this");
}

int main(){
    if(setjmp(buf))
        printf("back in main
");
    else {
        printf("first time through
");
        banana();
    }
    return 0;
}

5. libco Coroutine Structure

libco represents a coroutine with stCoRoutine_t, which stores the execution environment, function pointer, arguments, stack information, and status flags.

struct stCoRoutine_t {
    stCoRoutineEnv_t *env;      // execution environment
    pfn_co_routine_t pfn;      // coroutine function
    void *arg;                // argument
    coctx_t ctx;              // saved context
    ...
    char cEnableSysHook;      // system hook flag
    char cIsShareStack;       // shared‑stack flag
    void *pvEnv;
    stStackMem_t* stack_mem; // stack memory
    char* stack_sp;           // stack pointer
    unsigned int save_size;
    char* save_buffer;
};

6. NtyCo – A Coroutine‑Based I/O Framework

NtyCo combines coroutine scheduling with asynchronous I/O. It provides APIs such as nty_coroutine_create, nty_coroutine_resume, nty_coroutine_yield, and POSIX‑style socket wrappers ( nty_socket, nty_accept, nty_recv, nty_send, nty_close).

Creating a Coroutine

int nty_coroutine_create(nty_coroutine **new_co, proc_coroutine func, void *arg) {
    // allocate and initialise coroutine structure
    nty_coroutine *co = calloc(1, sizeof(nty_coroutine));
    posix_memalign(&co->stack, getpagesize(), sched->stack_size);
    co->func = func;
    co->arg = arg;
    co->birth = nty_coroutine_usec_now();
    *new_co = co;
    TAILQ_INSERT_TAIL(&co->sched->ready, co, ready_next);
    return 0;
}

Yield and Resume

void nty_coroutine_yield(nty_coroutine *co) {
    // save current context and switch back to scheduler
    co_swap(co, scheduler_current);
}

int nty_coroutine_resume(nty_coroutine *co) {
    // restore coroutine context and run until next yield
    co_swap(scheduler_current, co);
    return 0;
}

7. Scheduler Design

The scheduler maintains three collections: a ready queue (FIFO), a sleep tree (ordered by wake‑up time), and a wait tree (for I/O events). It repeatedly:

Moves expired sleep entries to the ready queue.

Processes epoll/kqueue events, moving ready I/O coroutines to the ready queue.

Resumes coroutines from the ready queue.

8. Epoll‑Based Event Loop

Coroutines use co_poll to register file descriptors with epoll. The function stores a pointer to a stPollItem_t in epoll_event.data.ptr so that when the event fires, the scheduler can retrieve the associated coroutine and resume it.

int co_poll(stCoEpoll_t *ctx, struct pollfd fds[], nfds_t nfds, int timeout_ms) {
    // register fds with epoll and associate each with its coroutine
    for (int i = 0; i < nfds; ++i) {
        struct epoll_event ev = { .events = fds[i].events, .data.ptr = &poll_items[i] };
        epoll_ctl(ctx->iEpollFd, EPOLL_CTL_ADD, fds[i].fd, &ev);
    }
    // optional timeout handling via timing‑wheel
    AddTimeout(...);
    co_yield_env(env); // suspend current coroutine
    // on wake‑up, remove fds and clean up
    return 0;
}

9. Timing‑Wheel Timer

For coroutine‑level timeouts, libco uses a 60‑second timing‑wheel. Each timeout item is placed into a bucket based on its expiration offset; the wheel advances once per second, moving expired items to the active list so their coroutines can be resumed.

10. Performance Evaluation

Tests on a 4‑core Ubuntu 14.04 server with 6 GB RAM and three client VMs showed that using coroutines with epoll yields ~900 ms response time for 1 000 concurrent connections, compared to ~6.5 s for a purely synchronous design, demonstrating the high throughput of coroutine‑based asynchronous I/O.

11. Practical Usage of libco

Typical usage steps:

Create a listening socket (non‑blocking).

Spawn a coroutine for each accepted connection using co_create and co_resume.

Inside the coroutine, perform reads/writes via co_poll to let the scheduler handle readiness.

Run the main event loop with co_eventloop, which repeatedly calls epoll_wait, processes timed‑out coroutines, and resumes ready ones.

By keeping all I/O in a single thread of coroutines, developers obtain the simplicity of synchronous code while achieving the scalability of asynchronous, event‑driven servers.