How Python Sets Work Under the Hood: Implementation Details and API
This article explains Python's set type, covering its typical use cases, core API methods, and a deep dive into the CPython C implementation—including the PySetObject layout, hash table mechanics, field behavior, and differences from dictionaries.
Python's set behaves like a dictionary without values; it is essentially a hash‑table‑based collection used for deduplication, membership testing, and checking whether a sequence contains multiple items. Simple examples show how converting a list to a set removes duplicates and how set operations simplify common tasks.
Set API Overview
Key operations include add (O(1) insertion), remove (raises KeyError if absent), discard (no error), pop (random removal, O(1)), and clear. Set algebra is performed with operators: & (intersection), | (union), - (difference), ^ (symmetric difference), and equality checks. Creating an empty set requires set(), because {} denotes an empty dictionary.
Internal C Structure
The CPython implementation stores a set in PySetObject (see Include/cpython/setobject.h):
typedef struct {
PyObject_HEAD
Py_ssize_t fill; // active + dummy entries
Py_ssize_t used; // active entries only (ob_size)
Py_ssize_t mask; // table size - 1, used for index masking
setentry *table; // pointer to the entry array
Py_hash_t hash; // hash of the set (for frozenset)
Py_ssize_t finger; // used by pop()
setentry smalltable[PySet_MINSIZE]; // inline storage for up to 8 items
PyObject *weakreflist;
} PySetObject;Each setentry holds a key pointer and its pre‑computed hash:
typedef struct {
PyObject *key;
Py_hash_t hash;
} setentry;The mask field enables fast index calculation via hash & mask, which is equivalent to hash % size when the table size is a power of two. This design avoids costly modulo operations.
Demonstrating the Layout with ctypes
Using ctypes we can inspect a live set object. For a set {3.14, "abc", 666} the three elements reside in smalltable at indices 2, 3, and 6, as shown by printing each entry's stored hash. The fill and used fields both start at 3 because all entries are active.
from ctypes import *
class PyObject(Structure):
_fields_ = [("ob_refcnt", c_ssize_t), ("ob_type", c_void_p)]
class SetEntry(Structure):
_fields_ = [("key", POINTER(PyObject)), ("hash", c_longlong)]
class PySetObject(PyObject):
_fields_ = [
("fill", c_ssize_t),
("used", c_ssize_t),
("mask", c_ssize_t),
("table", POINTER(SetEntry)),
("hash", c_long),
("finger", c_ssize_t),
("smalltable", (SetEntry * 8)),
("weakreflist", POINTER(PyObject)),
]
s = {3.14, "abc", 666}
py_set_obj = PySetObject.from_address(id(s))
for index, entry in enumerate(py_set_obj.smalltable):
print(index, entry.hash)After removing all elements with repeated pop(), fill stays at 3 (active + dummy entries) while used drops to 0, illustrating the dummy‑entry mechanism that preserves probe chains.
Creation Process
The set constructor ultimately calls PySet_New (see Objects/setobject.c), which allocates a PySetObject with an initial capacity of 8, sets fill, used, and mask, points table to smalltable, and then inserts each element from the provided iterable via set_update_internal. The process is straightforward and mirrors dictionary creation, but without a value array.
PyObject *PySet_New(PyObject *iterable) {
return make_new_set(&PySet_Type, iterable);
}
static PyObject *make_new_set(PyTypeObject *type, PyObject *iterable) {
PySetObject *so = (PySetObject *)type->tp_alloc(type, 0);
if (so == NULL) return NULL;
so->fill = 0;
so->used = 0;
so->mask = PySet_MINSIZE - 1;
so->table = so->smalltable;
so->hash = -1;
so->finger = 0;
so->weakreflist = NULL;
if (iterable != NULL) {
if (set_update_internal(so, iterable)) {
Py_DECREF(so);
return NULL;
}
}
return (PyObject *)so;
}Dictionary vs. Set Hash Tables
Both structures use hash tables, but a dictionary stores key and value in each entry and maintains a separate index array, effectively using two arrays. A set stores only the key (the element) in a single array, which simplifies the implementation at the cost of higher memory usage because the table must remain sparse.
When an entry is deleted, dictionaries write a sentinel integer -2 (DKIX_DUMMY) into the index array, while sets write a pointer to a static dummy struct (
static PyObject _dummy_struct; #define dummy (&_dummy_struct)). In both cases the dummy marks the slot as deleted without breaking the probing chain, allowing later insertions to overwrite the dummy slot.
Key Takeaways
The set implementation is a trimmed‑down version of the dict implementation: it reuses the same hash‑table machinery, omits the value storage, and therefore has fewer fields and no dedicated cache pool. This makes the code path shorter and easier to understand, which is reflected in the detailed C‑level walkthrough above.
Future articles will explore higher‑level set operations and performance considerations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Satori Komeiji's Programming Classroom
Python and Rust developer; I write about any topics you're interested in. Follow me! (#^.^#)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
