How PEP 659 Boosts CPython Performance: Bytecode Specialization Explained
This article examines PEP 659, the Specializing Adaptive Interpreter proposed for CPython, explains its non‑JIT approach, details the warm‑up, adaptive, specializing and de‑optimisation phases, and walks through the actual source‑code implementation using the LOAD_GLOBAL instruction as a concrete example.
Background
In early 2021 Guido van Rossum returned to Microsoft to continue work on CPython and launched the faster‑python project, which aims to increase CPython performance five‑fold within four years. The project is open‑source on GitHub under the faster-cpython group, and several ideas already have prototype implementations.
PEP 659 Overview
PEP 659, created in April 2021, is titled Specializing Adaptive Interpreter . It introduces two key concepts: Specializing – replacing generic bytecode with specialized versions for frequently executed paths, and Adaptive – observing runtime behavior to decide which specialization to apply.
The proposal is deliberately not a JIT compiler; it targets environments where JITs cannot be used (e.g., iOS where code pages must be signed). Instead, it performs interpreter‑level optimisations that can still yield performance gains comparable to or exceeding naive JITs.
Specialization is typically done in the context of a JIT compiler, but research shows specialization in an interpreter can boost performance significantly, even outperforming a naive compiler.
Optimization Granularity
Optimisations are applied at the granularity of individual virtual‑machine instructions. For example, the LOAD_GLOBAL opcode can be split into two specialised forms: LOAD_GLOBAL_MODULE and LOAD_GLOBAL_BUILTIN, each caching the dictionary entry index to avoid repeated hash‑table lookups. An additional adaptive opcode LOAD_GLOBAL_ADAPTIVE performs the observation and replacement logic.
Warm‑up Phase
Each PyCodeObject gains a new field co_warmup that counts executions. When the counter reaches zero (after a configurable number of runs), the bytecode is considered hot and optimisation begins.
/* Bytecode object */
struct PyCodeObject {
PyObject_HEAD
PyObject *co_consts; /* list (constants used) */
PyObject *co_names; /* list of strings (names used) */
int co_warmup; /* Warmup counter for quickening */
union _cache_or_instruction *co_quickened;
};
#define QUICKENING_WARMUP_DELAY 8
#define QUICKENING_INITIAL_WARMUP_VALUE (-QUICKENING_WARMUP_DELAY)
static void init_code(PyCodeObject *co, struct _PyCodeConstructor *con) {
// ...
co->co_warmup = QUICKENING_INITIAL_WARMUP_VALUE;
co->co_quickened = NULL;
}Adaptive Phase
When co_warmup reaches zero, _Py_Quicken creates a copy of the original bytecode ( co_code) called co_quickened. All subsequent modifications happen on this copy.
int _Py_Quicken(PyCodeObject *code) {
if (code->co_quickened) return 0;
Py_ssize_t size = PyBytes_GET_SIZE(code->co_code);
int instr_count = (int)(size/sizeof(_Py_CODEUNIT));
if (instr_count > MAX_SIZE_TO_QUICKEN) {
code->co_warmup = QUICKENING_WARMUP_COLDEST;
return 0;
}
int entry_count = entries_needed(code->co_firstinstr, instr_count);
SpecializedCacheOrInstruction *quickened = allocate(entry_count, instr_count);
if (quickened == NULL) return -1;
_Py_CODEUNIT *new_instructions = first_instruction(quickened);
memcpy(new_instructions, code->co_firstinstr, size);
optimize(quickened, instr_count);
code->co_quickened = quickened;
code->co_firstinstr = new_instructions;
return 0;
}The quickened array stores both the specialised instructions and their associated cache entries. Each cache entry holds data such as dictionary version numbers, key indices, and a counter to avoid thrashing.
Specializing & De‑optimisation
After warm‑up and adaptive observation, hot instructions are replaced with specialised versions (e.g., LOAD_GLOBAL_MODULE or LOAD_GLOBAL_BUILTIN). If the cached assumptions become invalid (dictionary version changes or cache miss), the interpreter falls back to the generic LOAD_GLOBAL after a configurable number of misses.
TARGET(LOAD_GLOBAL_BUILTIN) {
PyDictObject *mdict = (PyDictObject *)GLOBALS();
PyDictObject *bdict = (PyDictObject *)BUILTINS();
SpecializedCacheEntry *caches = GET_CACHE();
_PyAdaptiveEntry *cache0 = &caches[0].adaptive;
_PyLoadGlobalCache *cache1 = &caches[-1].load_global;
DEOPT_IF(mdict->ma_keys->dk_version != cache1->module_keys_version, LOAD_GLOBAL);
DEOPT_IF(bdict->ma_keys->dk_version != cache1->builtin_keys_version, LOAD_GLOBAL);
PyDictKeyEntry *ep = DK_ENTRIES(bdict->ma_keys) + cache0->index;
PyObject *res = ep->me_value;
DEOPT_IF(res == NULL, LOAD_GLOBAL);
STAT_INC(LOAD_GLOBAL, hit);
Py_INCREF(res);
PUSH(res);
DISPATCH();
}
#define DEOPT_IF(cond, instname) if (cond) { goto instname##_miss; }
#define ADAPTIVE_CACHE_BACKOFF 64
static inline void cache_backoff(_PyAdaptiveEntry *entry) {
entry->counter = ADAPTIVE_CACHE_BACKOFF;
}
LOAD_GLOBAL_miss: {
STAT_INC(LOAD_GLOBAL, miss);
_PyAdaptiveEntry *cache = &GET_CACHE()->adaptive;
cache->counter--;
if (cache->counter == 0) {
next_instr[-1] = _Py_MAKECODEUNIT(LOAD_GLOBAL_ADAPTIVE, _Py_OPARG(next_instr[-1]));
STAT_INC(LOAD_GLOBAL, deopt);
cache_backoff(cache);
}
oparg = cache->original_oparg;
STAT_DEC(LOAD_GLOBAL, unquickened);
JUMP_TO_INSTRUCTION(LOAD_GLOBAL);
}The de‑optimisation logic ensures that the interpreter only reverts to the generic opcode after repeated cache misses, preventing excessive thrashing between specialised and generic paths.
Overall, PEP 659 demonstrates how a carefully designed adaptive and speculative specialisation framework can achieve substantial interpreter‑level speedups without relying on a full JIT, while still providing profiling information for future optimisations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
