How to Build a C++ Stackful Coroutine from Scratch: Deep Dive into Context Switching
This article explains the low‑level principles of C++ coroutine context switching, walks through the owl.context API design, demonstrates assembly implementations for co_getcontext, co_setcontext, co_swapcontext and co_makecontext, and provides practical examples and code to build stackful coroutines on 32‑bit ARM.
Introduction
In the previous article "Design and Implementation of WeChat's Self‑Developed C++ Coroutine Framework" we introduced the evolution of asynchronous programming and the overall design of the owl coroutine, but we did not dive into the concrete implementation details. Implementing a stackful coroutine in C++ hinges on context switching; in the owl architecture, owl.context sits at the lowest layer, and all upper‑level APIs are built on top of it.
This article details the low‑level principles of C++ coroutine context switching and guides you step‑by‑step to implement a C++ coroutine from zero.
owl.context Interface Design
Well‑known context‑switching libraries in the industry include ucontext and boost.context . The ucontext API is complete and its semantics are clear, while boost.context’s API is more obscure. To keep the code easy to understand, owl.context initially aimed to be compatible with the ucontext interface, but we found that some ucontext design choices are unreasonable today and would add unnecessary complexity. Therefore the final interface retains ucontext’s semantics but optimizes several details.
owl.context provides four APIs; we first list the interface definitions and then explain each API’s implementation.
typedef struct {
void* base;
size_t size;
} co_stack_t;
typedef struct co_context {
co_reg_t regs[32];
co_stack_t stack;
struct co_context* link;
} co_context_t;
// Get current context, returns 0 on normal return, 1 when returning via co_setcontext
int co_getcontext(co_context_t* ctx);
// Switch to the specified context
void co_setcontext(const co_context_t* ctx);
// Save current context to octx, then switch to ctx
void co_swapcontext(co_context_t* octx, const co_context_t* ctx);
// Create a new context on a given stack and set up the execution environment for fn
void co_makecontext(co_context_t* ctx, void (*fn)(uintptr_t), uintptr_t arg);Context Switch Example
Before explaining the implementation of the above APIs, we first illustrate the basic concept of context switching with a simple example.
void test() {
printf("start
");
volatile int n = 3;
co_context_t ctx;
int ret = co_getcontext(&ctx);
if (n > 0) {
printf("ret = %d, n = %d
", ret, n);
sleep(1);
--n;
co_setcontext(&ctx);
}
printf("end
");
}Running result:
start
ret = 0, n = 3
ret = 1, n = 2
ret = 1, n = 1
endFrom the result we can see that co_getcontext and co_setcontext act like an enhanced goto, allowing execution flow to jump between stack frames within the same stack. The call to co_getcontext saves the current context into ctx; when co_setcontext(&ctx) is executed, the flow jumps back to the line after co_getcontext, making the function appear to return again, this time with a return value of 1.
Context Switch Principles
To implement context switching, we must first understand the concept of a thread’s context. A running thread’s context consists of two parts:
CPU register values
Thread‑private data
Only a few platforms (e.g., Win32) have thread‑private data; for most mainstream operating systems the context is essentially the CPU registers. Therefore, implementing a context switch only requires saving and restoring registers.
Which registers need to be saved? This depends on the calling convention. Using the 32‑bit ARM AAPCS (Procedure Call Standard for the ARM Architecture) as an example, the convention defines:
16 integer registers r0‑r15 and 32 floating‑point registers s0‑s31.
r0‑r3 are used for arguments, r0‑r1 for return values.
r4‑r8, r10, r11, s16‑s31 are callee‑saved registers .
r9 is platform‑specific and may be treated as callee‑saved.
r11‑r15 are special registers (r11=FP, r12=IP, r13=SP, r14=LR, r15=PC).
Thus co_getcontext must save the callee‑saved registers, r9, the stack pointer (SP) and the link register (LR). co_setcontext restores them.
Because each CPU architecture has its own instruction set and calling convention, and even different operating systems on the same architecture may differ, all APIs in this article are implemented for the 32‑bit ARM architecture.
co_getcontext Implementation
Based on the analysis above, implementing co_getcontext is straightforward: we save registers r4‑r11, SP, LR, and the floating‑point registers s16‑s31 into ctx->regs.
.globl co_getcontext
co_getcontext:
/* save r4‑r11, lr, sp to regs[0‑9] */
mov r1, sp
stmia r0!, { r4‑r11, lr }
stmia r0!, { r1 }
/* save s16‑s31 to regs[16‑31] */
add r0, r0, #24
vstmia r0, { s16‑s31 }
/* return 0 */
mov r0, #0
mov pc, lrMemory layout of ctx->regs is shown below:
co_setcontext Implementation
co_setcontextmirrors co_getcontext: it restores the saved registers.
.globl co_setcontext
co_setcontext:
/* load r4‑r11, lr, sp from regs[0‑9] */
ldmia r0!, { r4‑r11, lr }
ldmia r0!, { r1 }
mov sp, r1
/* load s16‑s31 from regs[16‑31] */
add r0, r0, #24
vldmia r0, { s16‑s31 }
/* make co_getcontext() return 1 */
mov r0, #1
mov pc, lrThe subtle point is that a normal call to co_getcontext returns 0, but after the final two assembly lines the function returns again with a value of 1.
co_swapcontext Implementation
co_swapcontextessentially calls co_getcontext first and then co_setcontext. It can be expressed in C as:
void co_swapcontext(co_context_t* octx, const co_context_t* ctx) {
if (co_getcontext(octx) == 0) {
co_setcontext(ctx);
}
}Note: In the glibc implementation of ucontext, swapcontext() is not a simple wrapper; it re‑implements the save/restore logic in assembly. owl.context reuses co_getcontext and co_setcontext , greatly reducing assembly code size.
co_makecontext Example
Using only co_getcontext and co_setcontext allows jumps within the same call stack, which is of limited practical use. To create a stackful coroutine each coroutine needs its own stack; co_makecontext creates a new execution environment on a specified stack.
co_context_t ctx0;
co_context_t ctx1;
void co_hello(uintptr_t arg) {
printf("co_hello() Enter arg = %lu
", arg);
co_swapcontext(&ctx1, &ctx0);
printf("co_hello() Exit
");
}
void test_make_context() {
printf("main start
");
char stack[4096];
// 1. set stack
ctx1.stack.base = stack;
ctx1.stack.size = sizeof(stack);
// 2. set link for when co_hello returns
ctx1.link = &ctx0;
// 3. create execution environment for co_hello
co_makecontext(&ctx1, &co_hello, 100);
printf("main start co_hello
");
co_swapcontext(&ctx0, &ctx1);
printf("main resume co_hello
");
co_swapcontext(&ctx0, &ctx1);
printf("main end
");
}Running result:
main start
main start co_hello
co_hello() Enter arg = 100
main resume co_hello
co_hello() Exit
main end co_makecontextcreates an execution environment by specifying stack address, stack size, entry function, and argument—similar to pthread_create, except it does not spawn a new thread.
co_makecontext Implementation
To implement co_makecontext we need to understand the AAPCS calling convention, which defines the responsibilities of the caller and callee. The caller must:
Set up function arguments: up to four arguments go in r0‑r3; additional arguments are pushed onto the stack from right to left.
Ensure the stack is 8‑byte aligned (SP % 8 == 0) after pushing arguments.
Because the entry function for owl.context has the prototype void (uintptr_t), it takes a single argument that can be placed directly in r0.
Note: The ucontext makecontext prototype is void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...); . It supports multiple integer arguments, requiring stack pushes for arguments beyond the register limit, which makes its implementation more complex. owl.context limits the entry function to a single argument, simplifying the design.
Implementation steps:
Store the entry function’s argument in r0.
Ensure the stack is 8‑byte aligned.
Set the link register (lr) to a stub function that jumps to the link context after the entry function returns.
The stub function co_jump_to_link is implemented in assembly:
.globl co_jump_to_link
co_jump_to_link:
/* when fn(arg) returns, call co_setcontext(link) */
movs r0, r4
bne co_setcontext
b exitMemory layout of the updated ctx->regs (now including fn and arg) is shown below:
Implementation of co_makecontext simply sets the registers r4 (link), lr (stub), sp, fn, and arg. The other callee‑saved registers ( r5‑r11, s16‑s31) are irrelevant for a fresh stack.
#define R4 0
#define LR 8
#define SP 9
#define FN 10
#define ARG 11
void co_makecontext(co_context_t* ctx, void (*fn)(uintptr_t), uintptr_t arg) {
uintptr_t stack_top = (uintptr_t)ctx->stack.base + ctx->stack.size;
/* ensure the stack 8 byte aligned */
uintptr_t* sp = (uintptr_t*)(stack_top & -8L);
ctx->regs[R4] = (uintptr_t)ctx->link;
ctx->regs[LR] = (uintptr_t)&co_jump_to_link;
ctx->regs[SP] = (uintptr_t)sp;
ctx->regs[FN] = (uintptr_t)fn;
ctx->regs[ARG] = arg;
}Final Adjustments
To support co_makecontext, we modify the implementations of co_getcontext and co_setcontext:
In co_getcontext, set fn and arg to zero.
In co_setcontext, if fn is non‑zero, call fn(arg) before restoring the rest of the context; otherwise follow the previous logic.
.globl co_getcontext
co_getcontext:
/* r1 = sp, r2 = fn, r3 = arg */
mov r1, sp
mov r2, #0
mov r3, #0
stmia r0!, { r4‑r11, lr }
stmia r0!, { r1‑r3 }
add r0, r0, #16
vstmia r0, { s16‑s31 }
mov r0, #0
mov pc, lr
.globl co_setcontext
co_setcontext:
/* r1 = sp, r2 = fn, r3 = arg */
ldmia r0!, { r4‑r11, lr }
ldmia r0!, { r1‑r3 }
mov sp, r1
add r0, r0, #16
vldmia r0, { s16‑s31 }
cmp r2, #0
bne .cofunc
mov r0, #1
mov pc, lr
.cofunc:
mov r0, r3
mov pc, r2Conclusion
We have dissected the implementation of owl.context on the 32‑bit ARM architecture. Extending it to other architectures follows the same pattern: understand the target CPU’s instruction set and calling convention, then save/restore the appropriate registers. In practice, many platform‑specific pitfalls arise, such as handling C++ exceptions on Win32, special treatment of FS/GS registers on Windows, differences between x64 and AMD64 calling conventions, ARM/THUMB compatibility, and watchOS’s arm64_32 quirks. This article stops here due to space constraints, but future posts will explore those challenges in depth.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Client Technology Team
Official account of the WeChat mobile client development team, sharing development experience, cutting‑edge tech, and little‑known stories across Android, iOS, macOS, Windows Phone, and Windows.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
