Fundamentals 18 min read

How Python Sets Work Under the Hood: Implementation Details and API

This article explains Python's set type, covering its typical use cases, core API methods, and a deep dive into the CPython C implementation—including the PySetObject layout, hash table mechanics, field behavior, and differences from dictionaries.

Satori Komeiji's Programming Classroom
Satori Komeiji's Programming Classroom
Satori Komeiji's Programming Classroom
How Python Sets Work Under the Hood: Implementation Details and API

Python's set behaves like a dictionary without values; it is essentially a hash‑table‑based collection used for deduplication, membership testing, and checking whether a sequence contains multiple items. Simple examples show how converting a list to a set removes duplicates and how set operations simplify common tasks.

Set API Overview

Key operations include add (O(1) insertion), remove (raises KeyError if absent), discard (no error), pop (random removal, O(1)), and clear. Set algebra is performed with operators: & (intersection), | (union), - (difference), ^ (symmetric difference), and equality checks. Creating an empty set requires set(), because {} denotes an empty dictionary.

Internal C Structure

The CPython implementation stores a set in PySetObject (see Include/cpython/setobject.h):

typedef struct {
    PyObject_HEAD
    Py_ssize_t fill;      // active + dummy entries
    Py_ssize_t used;      // active entries only (ob_size)
    Py_ssize_t mask;      // table size - 1, used for index masking
    setentry *table;      // pointer to the entry array
    Py_hash_t hash;        // hash of the set (for frozenset)
    Py_ssize_t finger;   // used by pop()
    setentry smalltable[PySet_MINSIZE]; // inline storage for up to 8 items
    PyObject *weakreflist;
} PySetObject;

Each setentry holds a key pointer and its pre‑computed hash:

typedef struct {
    PyObject *key;
    Py_hash_t hash;
} setentry;

The mask field enables fast index calculation via hash & mask, which is equivalent to hash % size when the table size is a power of two. This design avoids costly modulo operations.

Demonstrating the Layout with ctypes

Using ctypes we can inspect a live set object. For a set {3.14, "abc", 666} the three elements reside in smalltable at indices 2, 3, and 6, as shown by printing each entry's stored hash. The fill and used fields both start at 3 because all entries are active.

from ctypes import *

class PyObject(Structure):
    _fields_ = [("ob_refcnt", c_ssize_t), ("ob_type", c_void_p)]

class SetEntry(Structure):
    _fields_ = [("key", POINTER(PyObject)), ("hash", c_longlong)]

class PySetObject(PyObject):
    _fields_ = [
        ("fill", c_ssize_t),
        ("used", c_ssize_t),
        ("mask", c_ssize_t),
        ("table", POINTER(SetEntry)),
        ("hash", c_long),
        ("finger", c_ssize_t),
        ("smalltable", (SetEntry * 8)),
        ("weakreflist", POINTER(PyObject)),
    ]

s = {3.14, "abc", 666}
py_set_obj = PySetObject.from_address(id(s))
for index, entry in enumerate(py_set_obj.smalltable):
    print(index, entry.hash)

After removing all elements with repeated pop(), fill stays at 3 (active + dummy entries) while used drops to 0, illustrating the dummy‑entry mechanism that preserves probe chains.

Creation Process

The set constructor ultimately calls PySet_New (see Objects/setobject.c), which allocates a PySetObject with an initial capacity of 8, sets fill, used, and mask, points table to smalltable, and then inserts each element from the provided iterable via set_update_internal. The process is straightforward and mirrors dictionary creation, but without a value array.

PyObject *PySet_New(PyObject *iterable) {
    return make_new_set(&PySet_Type, iterable);
}

static PyObject *make_new_set(PyTypeObject *type, PyObject *iterable) {
    PySetObject *so = (PySetObject *)type->tp_alloc(type, 0);
    if (so == NULL) return NULL;
    so->fill = 0;
    so->used = 0;
    so->mask = PySet_MINSIZE - 1;
    so->table = so->smalltable;
    so->hash = -1;
    so->finger = 0;
    so->weakreflist = NULL;
    if (iterable != NULL) {
        if (set_update_internal(so, iterable)) {
            Py_DECREF(so);
            return NULL;
        }
    }
    return (PyObject *)so;
}

Dictionary vs. Set Hash Tables

Both structures use hash tables, but a dictionary stores key and value in each entry and maintains a separate index array, effectively using two arrays. A set stores only the key (the element) in a single array, which simplifies the implementation at the cost of higher memory usage because the table must remain sparse.

When an entry is deleted, dictionaries write a sentinel integer -2 (DKIX_DUMMY) into the index array, while sets write a pointer to a static dummy struct (

static PyObject _dummy_struct; #define dummy (&_dummy_struct)

). In both cases the dummy marks the slot as deleted without breaking the probing chain, allowing later insertions to overwrite the dummy slot.

Key Takeaways

The set implementation is a trimmed‑down version of the dict implementation: it reuses the same hash‑table machinery, omits the value storage, and therefore has fewer fields and no dedicated cache pool. This makes the code path shorter and easier to understand, which is reflected in the detailed C‑level walkthrough above.

Future articles will explore higher‑level set operations and performance considerations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

memory-managementdata structureshash tablesetCPythonctypes
Satori Komeiji's Programming Classroom
Written by

Satori Komeiji's Programming Classroom

Python and Rust developer; I write about any topics you're interested in. Follow me! (#^.^#)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.