Fundamentals 24 min read

How Python Implements Dictionaries: Inside the Underlying Structure

This article dissects Python's dictionary implementation, explaining the PyDictObject layout, the combined and split hash‑table designs, memory‑optimisation techniques, insertion and lookup processes, ordered iteration, and precise memory‑size calculations with concrete code examples.

Satori Komeiji's Programming Classroom
Satori Komeiji's Programming Classroom
Satori Komeiji's Programming Classroom
How Python Implements Dictionaries: Inside the Underlying Structure

CPython dictionary object layout

In CPython a dictionary is represented by the PyDictObject structure:

// Include/cpython/dictobject.h
typedef struct {
    PyObject_HEAD
    Py_ssize_t ma_used;            // number of entries (dictionary length)
    uint64_t ma_version_tag;       // changes on any modification (used by iterators)
    PyDictKeysObject *ma_keys;     // pointer to the keys object
    PyDictValues *ma_values;       // pointer to the values object (NULL for combined tables)
} PyDictObject;

The ma_keys pointer always points to a PyDictKeysObject. If the dictionary uses a combined table the ma_values pointer is NULL; if a split table is used both pointers are non‑NULL.

Early CPython used only combined tables. The split‑table design was added by PEP‑0412 to let multiple dictionaries share the same set of keys, reducing memory when many objects have identical attribute names.

from ctypes import *

class PyObject(Structure):
    _fields_ = [("ob_refcnt", c_ssize_t), ("ob_type", c_void_p)]

class PyDictObject(PyObject):
    _fields_ = [
        ("ma_used", c_ssize_t),
        ("ma_version_tag", c_uint64),
        ("ma_keys", c_void_p),
        ("ma_values", c_void_p),
    ]

d = {"a": 1, "b": 2}
print(PyDictObject.from_address(id(d)).ma_values)  # None → combined table

class A: pass

a1 = A()
a2 = A()
print(PyDictObject.from_address(id(a1.__dict__)).ma_values,
      PyDictObject.from_address(id(a2.__dict__)).ma_values)  # non‑None → split table
print(PyDictObject.from_address(id(a1.__dict__)).ma_keys,
      PyDictObject.from_address(id(a2.__dict__)).ma_keys)  # identical keys pointer

Key‑set object ( PyDictKeysObject )

The core of the hash table lives in PyDictKeysObject. Its definition (simplified) is:

// Include/cpython/dictobject.h
typedef enum {
    DICT_KEYS_GENERAL = 0,   // arbitrary key types
    DICT_KEYS_UNICODE = 1,   // all keys are Unicode strings
    DICT_KEYS_SPLIT   = 2    // split table (keys shared)
} DictKeysKind;

// Include/internal/pycore_dict.h
struct _dictkeysobject {
    Py_ssize_t dk_refcnt;               // reference count of the key set
    uint8_t dk_log2_size;               // log2 of the hash‑table size (power‑of‑2)
    uint8_t dk_log2_index_bytes;        // log2 of the index‑array byte size
    uint8_t dk_kind;                    // one of DictKeysKind values
    uint32_t dk_version;                // changes when the key set changes
    Py_ssize_t dk_usable;               // how many entries can still be added
    Py_ssize_t dk_nentries;             // current number of entries
    char dk_indices[];                  // flexible array: hash‑index slots (store indices into dk_entries)
    // dk_entries follows dk_indices in memory; each entry is either PyDictKeyEntry or PyDictUnicodeEntry
};

Entry structures

Each entry stores a key, a value, and (for non‑string keys) the pre‑computed hash:

// Include/internal/pycore_dict.h
typedef struct {
    Py_hash_t me_hash;   // cached hash of the key
    PyObject *me_key;    // pointer to the key object
    PyObject *me_value;  // pointer to the value object
} PyDictKeyEntry;

typedef struct {
    PyObject *me_key;    // Unicode strings already carry a hash
    PyObject *me_value;  // pointer to the value object
} PyDictUnicodeEntry; // used from Python 3.12 when all keys are strings

Insertion algorithm (example key "komeiji")

Store a new PyDictKeyEntry in dk_entries at the next free index (0 for an empty dict).

Compute the hash of the key, map it to an index in the hash‑index array (e.g., 6).

Write the entry index (0) into dk_indices[6].

Lookup algorithm

Lookup repeats the hash computation, reads the index from dk_indices, fetches the entry from dk_entries, and compares the stored key with the query key. All steps are O(1), which explains the constant‑time performance of dictionary operations.

Iteration order (Python 3.6+)

Since entries are appended to dk_entries in insertion order, iterating simply walks this array. The order is an implementation detail and should not be relied upon for program logic.

Memory optimisation – splitting the hash table

Early CPython stored both the hash‑index and the entries in a single array, which required a 1/3 waste because the usable fraction of a hash table must stay below 2/3 to keep the load factor low. The macro used to compute the usable limit is:

// Objects/dictobject.c
#define USABLE_FRACTION(n) (((n) << 1) / 3)

For a table of length 8, at most 5 entries can be stored (8 × 2/3 ≈ 5). CPython therefore allocates only 2/3 of the space for dk_entries and stores the mapping indices in a separate dk_indices array (1 byte per slot for capacities ≤ 255, 2 bytes for capacities ≤ 65535, etc.). This two‑array design saves memory because an entry occupies 24 bytes (generic) or 16 bytes (all‑string) while an index slot occupies only 1–2 bytes.

Memory size calculations

Sizes of the fixed parts (ignoring the flexible arrays):

typedef struct {
    PyObject_HEAD                 // 16 bytes
    Py_ssize_t ma_used;           // 8 bytes
    uint64_t ma_version_tag;      // 8 bytes
    PyDictKeysObject *ma_keys;     // 8 bytes
    PyDictValues *ma_values;      // 8 bytes
} PyDictObject;                    // total 48 bytes

struct _dictkeysobject {
    Py_ssize_t dk_refcnt;          // 8 bytes
    uint8_t dk_log2_size;          // 1 byte
    uint8_t dk_log2_index_bytes;   // 1 byte
    uint8_t dk_kind;               // 1 byte
    uint32_t dk_version;           // 4 bytes
    Py_ssize_t dk_usable;          // 8 bytes
    Py_ssize_t dk_nentries;        // 8 bytes
    char dk_indices[];             // flexible array (not counted here)
    // dk_entries follows dk_indices
}; // sizeof(PyDictKeysObject) == 32 bytes (including alignment padding)

The total memory of a non‑empty dictionary is:

total = sizeof(PyDictObject) + sizeof(PyDictKeysObject) +
        size_of(dk_indices) + size_of(dk_entries)

For a hash‑table capacity of 8:

Index array: 8 bytes (1 byte per slot).

Usable entry slots: USABLE_FRACTION(8) = 5.

If all keys are strings ( PyDictUnicodeEntry, 16 bytes each): 48 + 32 + 8 + 5 × 16 = 168 bytes.

Otherwise generic entries ( PyDictKeyEntry, 24 bytes each): 48 + 32 + 8 + 5 × 24 = 208 bytes.

When the dictionary grows to hold 7 entries, the hash‑table expands to capacity 16 (next power of two). The usable entry slots become USABLE_FRACTION(16) = 10:

All‑string keys: 48 + 32 + 16 + 10 × 16 = 256 bytes.

Generic keys: 48 + 32 + 16 + 10 × 24 = 336 bytes.

These formulas match the empirical measurements shown in the original article.

Capacity rules

Dictionary capacity is the length of the dk_indices array and is always a power of two. The minimum capacity is defined by:

#define PyDict_LOG_MINSIZE 3
#define PyDict_MINSIZE (1 << PyDict_LOG_MINSIZE)   // 8

When the number of used entries reaches dk_usable (i.e., 2/3 of the capacity), the table is resized to the next power‑of‑two size.

Key take‑aways

CPython dictionaries are hash tables that achieve O(1) lookup and insertion by storing keys, values, and (for non‑string keys) cached hashes.

Since Python 3.6 the entry array preserves insertion order, giving dictionaries deterministic iteration order.

Memory is saved by splitting the hash table into a compact index array ( dk_indices) and a reduced‑size entry array ( dk_entries), avoiding the 1/3 waste of a single‑array design.

PEP‑0412 introduced split tables so that dictionaries belonging to instances of the same class can share a single PyDictKeysObject, further reducing memory when attribute names are identical.

When all keys are Unicode strings (Python 3.12+), PyDictUnicodeEntry omits the cached hash because the string object already stores it, saving 8 bytes per entry.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythonmemory optimizationdata structureshash tabledictCPython
Satori Komeiji's Programming Classroom
Written by

Satori Komeiji's Programming Classroom

Python and Rust developer; I write about any topics you're interested in. Follow me! (#^.^#)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.