How Python Implements Dictionaries: Inside the Underlying Structure
This article dissects Python's dictionary implementation, explaining the PyDictObject layout, the combined and split hash‑table designs, memory‑optimisation techniques, insertion and lookup processes, ordered iteration, and precise memory‑size calculations with concrete code examples.
CPython dictionary object layout
In CPython a dictionary is represented by the PyDictObject structure:
// Include/cpython/dictobject.h
typedef struct {
PyObject_HEAD
Py_ssize_t ma_used; // number of entries (dictionary length)
uint64_t ma_version_tag; // changes on any modification (used by iterators)
PyDictKeysObject *ma_keys; // pointer to the keys object
PyDictValues *ma_values; // pointer to the values object (NULL for combined tables)
} PyDictObject;The ma_keys pointer always points to a PyDictKeysObject. If the dictionary uses a combined table the ma_values pointer is NULL; if a split table is used both pointers are non‑NULL.
Early CPython used only combined tables. The split‑table design was added by PEP‑0412 to let multiple dictionaries share the same set of keys, reducing memory when many objects have identical attribute names.
from ctypes import *
class PyObject(Structure):
_fields_ = [("ob_refcnt", c_ssize_t), ("ob_type", c_void_p)]
class PyDictObject(PyObject):
_fields_ = [
("ma_used", c_ssize_t),
("ma_version_tag", c_uint64),
("ma_keys", c_void_p),
("ma_values", c_void_p),
]
d = {"a": 1, "b": 2}
print(PyDictObject.from_address(id(d)).ma_values) # None → combined table
class A: pass
a1 = A()
a2 = A()
print(PyDictObject.from_address(id(a1.__dict__)).ma_values,
PyDictObject.from_address(id(a2.__dict__)).ma_values) # non‑None → split table
print(PyDictObject.from_address(id(a1.__dict__)).ma_keys,
PyDictObject.from_address(id(a2.__dict__)).ma_keys) # identical keys pointerKey‑set object ( PyDictKeysObject )
The core of the hash table lives in PyDictKeysObject. Its definition (simplified) is:
// Include/cpython/dictobject.h
typedef enum {
DICT_KEYS_GENERAL = 0, // arbitrary key types
DICT_KEYS_UNICODE = 1, // all keys are Unicode strings
DICT_KEYS_SPLIT = 2 // split table (keys shared)
} DictKeysKind;
// Include/internal/pycore_dict.h
struct _dictkeysobject {
Py_ssize_t dk_refcnt; // reference count of the key set
uint8_t dk_log2_size; // log2 of the hash‑table size (power‑of‑2)
uint8_t dk_log2_index_bytes; // log2 of the index‑array byte size
uint8_t dk_kind; // one of DictKeysKind values
uint32_t dk_version; // changes when the key set changes
Py_ssize_t dk_usable; // how many entries can still be added
Py_ssize_t dk_nentries; // current number of entries
char dk_indices[]; // flexible array: hash‑index slots (store indices into dk_entries)
// dk_entries follows dk_indices in memory; each entry is either PyDictKeyEntry or PyDictUnicodeEntry
};Entry structures
Each entry stores a key, a value, and (for non‑string keys) the pre‑computed hash:
// Include/internal/pycore_dict.h
typedef struct {
Py_hash_t me_hash; // cached hash of the key
PyObject *me_key; // pointer to the key object
PyObject *me_value; // pointer to the value object
} PyDictKeyEntry;
typedef struct {
PyObject *me_key; // Unicode strings already carry a hash
PyObject *me_value; // pointer to the value object
} PyDictUnicodeEntry; // used from Python 3.12 when all keys are stringsInsertion algorithm (example key "komeiji")
Store a new PyDictKeyEntry in dk_entries at the next free index (0 for an empty dict).
Compute the hash of the key, map it to an index in the hash‑index array (e.g., 6).
Write the entry index (0) into dk_indices[6].
Lookup algorithm
Lookup repeats the hash computation, reads the index from dk_indices, fetches the entry from dk_entries, and compares the stored key with the query key. All steps are O(1), which explains the constant‑time performance of dictionary operations.
Iteration order (Python 3.6+)
Since entries are appended to dk_entries in insertion order, iterating simply walks this array. The order is an implementation detail and should not be relied upon for program logic.
Memory optimisation – splitting the hash table
Early CPython stored both the hash‑index and the entries in a single array, which required a 1/3 waste because the usable fraction of a hash table must stay below 2/3 to keep the load factor low. The macro used to compute the usable limit is:
// Objects/dictobject.c
#define USABLE_FRACTION(n) (((n) << 1) / 3)For a table of length 8, at most 5 entries can be stored (8 × 2/3 ≈ 5). CPython therefore allocates only 2/3 of the space for dk_entries and stores the mapping indices in a separate dk_indices array (1 byte per slot for capacities ≤ 255, 2 bytes for capacities ≤ 65535, etc.). This two‑array design saves memory because an entry occupies 24 bytes (generic) or 16 bytes (all‑string) while an index slot occupies only 1–2 bytes.
Memory size calculations
Sizes of the fixed parts (ignoring the flexible arrays):
typedef struct {
PyObject_HEAD // 16 bytes
Py_ssize_t ma_used; // 8 bytes
uint64_t ma_version_tag; // 8 bytes
PyDictKeysObject *ma_keys; // 8 bytes
PyDictValues *ma_values; // 8 bytes
} PyDictObject; // total 48 bytes
struct _dictkeysobject {
Py_ssize_t dk_refcnt; // 8 bytes
uint8_t dk_log2_size; // 1 byte
uint8_t dk_log2_index_bytes; // 1 byte
uint8_t dk_kind; // 1 byte
uint32_t dk_version; // 4 bytes
Py_ssize_t dk_usable; // 8 bytes
Py_ssize_t dk_nentries; // 8 bytes
char dk_indices[]; // flexible array (not counted here)
// dk_entries follows dk_indices
}; // sizeof(PyDictKeysObject) == 32 bytes (including alignment padding)The total memory of a non‑empty dictionary is:
total = sizeof(PyDictObject) + sizeof(PyDictKeysObject) +
size_of(dk_indices) + size_of(dk_entries)For a hash‑table capacity of 8:
Index array: 8 bytes (1 byte per slot).
Usable entry slots: USABLE_FRACTION(8) = 5.
If all keys are strings ( PyDictUnicodeEntry, 16 bytes each): 48 + 32 + 8 + 5 × 16 = 168 bytes.
Otherwise generic entries ( PyDictKeyEntry, 24 bytes each): 48 + 32 + 8 + 5 × 24 = 208 bytes.
When the dictionary grows to hold 7 entries, the hash‑table expands to capacity 16 (next power of two). The usable entry slots become USABLE_FRACTION(16) = 10:
All‑string keys: 48 + 32 + 16 + 10 × 16 = 256 bytes.
Generic keys: 48 + 32 + 16 + 10 × 24 = 336 bytes.
These formulas match the empirical measurements shown in the original article.
Capacity rules
Dictionary capacity is the length of the dk_indices array and is always a power of two. The minimum capacity is defined by:
#define PyDict_LOG_MINSIZE 3
#define PyDict_MINSIZE (1 << PyDict_LOG_MINSIZE) // 8When the number of used entries reaches dk_usable (i.e., 2/3 of the capacity), the table is resized to the next power‑of‑two size.
Key take‑aways
CPython dictionaries are hash tables that achieve O(1) lookup and insertion by storing keys, values, and (for non‑string keys) cached hashes.
Since Python 3.6 the entry array preserves insertion order, giving dictionaries deterministic iteration order.
Memory is saved by splitting the hash table into a compact index array ( dk_indices) and a reduced‑size entry array ( dk_entries), avoiding the 1/3 waste of a single‑array design.
PEP‑0412 introduced split tables so that dictionaries belonging to instances of the same class can share a single PyDictKeysObject, further reducing memory when attribute names are identical.
When all keys are Unicode strings (Python 3.12+), PyDictUnicodeEntry omits the cached hash because the string object already stores it, saving 8 bytes per entry.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Satori Komeiji's Programming Classroom
Python and Rust developer; I write about any topics you're interested in. Follow me! (#^.^#)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
