How Python’s Set Operations Are Implemented Under the Hood
This article dissects Python’s built‑in set type, explaining how add, pop, remove, copy, update, union and other set operations map to low‑level C functions, how the hash table handles collisions, dummy entries, resizing, and the finger optimization that speeds up element removal.
Python’s built‑in set type is implemented in C (file Objects/setobject.c) as an open‑addressing hash table. The article walks through each public method— add, pop, remove / discard, copy, update, union / intersection —and shows the exact C functions they invoke, the hash‑value handling, probing strategy, and resizing logic.
Add method: adding an element
Calling set.add(x) invokes set_add, which first calls set_add_key to compute the element’s hash (using the cached hash for strings or PyObject_Hash otherwise). The core insertion happens in set_add_entry:
static PyObject *set_add(PySetObject *so, PyObject *key) {
if (set_add_key(so, key))
return NULL;
Py_RETURN_NONE;
}
static int set_add_entry(PySetObject *so, PyObject *key, Py_hash_t hash) {
setentry *table;
setentry *freeslot;
setentry *entry;
size_t perturb;
size_t mask;
size_t i;
int probes;
int cmp;
Py_INCREF(key);
restart:
mask = so->mask;
i = (size_t)hash & mask;
freeslot = NULL;
perturb = hash;
while (1) {
entry = &so->table[i];
probes = (i + LINEAR_PROBES <= mask) ? LINEAR_PROBES : 0;
do {
if (entry->hash == 0 && entry->key == NULL)
goto found_unused_or_dummy;
if (entry->hash == hash) {
PyObject *startkey = entry->key;
assert(startkey != dummy);
if (startkey == key)
goto found_active;
if (PyUnicode_CheckExact(startkey) && PyUnicode_CheckExact(key) && _PyUnicode_EQ(startkey, key))
goto found_active;
table = so->table;
Py_INCREF(startkey);
cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
Py_DECREF(startkey);
if (cmp > 0)
goto found_active;
if (cmp < 0)
goto comparison_error;
if (table != so->table || entry->key != startkey)
goto restart;
mask = so->mask;
}
else if (entry->hash == -1) {
assert(entry->key == dummy);
freeslot = entry;
}
entry++;
} while (probes--);
perturb >>= PERTURB_SHIFT;
i = (i * 5 + 1 + perturb) & mask;
}
found_unused_or_dummy:
if (freeslot == NULL)
goto found_unused;
so->used++;
freeslot->key = key;
freeslot->hash = hash;
return 0;
found_unused:
so->fill++;
so->used++;
entry->key = key;
entry->hash = hash;
if ((size_t)so->fill * 5 < mask * 3)
return 0;
return set_table_resize(so, so->used > 50000 ? so->used * 2 : so->used * 4);
found_active:
Py_DECREF(key);
return 0;
comparison_error:
Py_DECREF(key);
return -1;
}The algorithm mirrors dict insertion: compute hash & mask to obtain an index, then probe linearly (up to LINEAR_PROBES) and, if necessary, use a secondary hash perturbation. Unused slots ( hash==0 && key==NULL) are filled directly; dummy slots ( hash==-1 && key==dummy) are remembered for possible reuse. When the number of active + dummy entries exceeds three‑fifths of the table size, the table is resized (doubling if used>50000, otherwise quadrupling).
Pop method: removing an arbitrary element
set.pop()calls set_pop. It uses the finger field to remember where the previous pop left off, scans forward until it finds an active entry, returns its key, and then marks the slot as dummy ( key = dummy; hash = -1) and updates used and finger:
static PyObject *set_pop(PySetObject *so, PyObject *Py_UNUSED(ignored)) {
setentry *entry = so->table + (so->finger & so->mask);
setentry *limit = so->table + so->mask;
PyObject *key;
if (so->used == 0) {
PyErr_SetString(PyExc_KeyError, "pop from an empty set");
return NULL;
}
while (entry->key == NULL || entry->key == dummy) {
entry++;
if (entry > limit)
entry = so->table;
}
key = entry->key;
entry->key = dummy;
entry->hash = -1;
so->used--;
so->finger = entry - so->table + 1; /* next start position */
return key;
}A ctypes demo shows that after each pop the finger advances, avoiding a full scan on subsequent pops.
Remove / Discard methods: deleting a specific element
set.remove(x)maps to set_remove, which delegates to set_discard_key. If the key is not found, set_remove raises KeyError; discard silently returns. When the key is an unhashable container (e.g., a list or another set), the implementation creates a temporary frozenset and retries, clearing the original TypeError.
static PyObject *set_remove(PySetObject *so, PyObject *key) {
PyObject *tmpkey;
int rv;
rv = set_discard_key(so, key);
if (rv < 0) {
if (!PySet_Check(key) || !PyErr_ExceptionMatches(PyExc_TypeError))
return NULL;
PyErr_Clear();
tmpkey = make_new_set(&PyFrozenSet_Type, key);
if (tmpkey == NULL)
return NULL;
rv = set_discard_key(so, tmpkey);
Py_DECREF(tmpkey);
if (rv < 0)
return NULL;
}
if (rv == DISCARD_NOTFOUND) {
_PyErr_SetKeyError(key);
return NULL;
}
Py_RETURN_NONE;
}
static int set_discard_key(PySetObject *so, PyObject *key) {
setentry *entry = set_lookkey(so, key, hash);
if (entry == NULL)
return -1;
if (entry->key == NULL)
return DISCARD_NOTFOUND;
PyObject *old_key = entry->key;
entry->key = dummy;
entry->hash = -1;
so->used--;
Py_DECREF(old_key);
return DISCARD_FOUND;
}Examples demonstrate that removing a list raises TypeError, while removing a frozenset works and raises KeyError only when the element is absent.
Copy method: duplicating a set
set.copy()creates a fresh set object via make_new_set_basetype and then calls set_update_internal to copy all entries:
static PyObject *set_copy(PySetObject *so, PyObject *Py_UNUSED(ignored)) {
return make_new_set_basetype(Py_TYPE(so), (PyObject *)so);
}
static PyObject *make_new_set_basetype(PyTypeObject *type, PyObject *iterable) {
if (type != &PySet_Type && type != &PyFrozenSet_Type) {
if (PyType_IsSubtype(type, &PySet_Type))
type = &PySet_Type;
else
type = &PyFrozenSet_Type;
}
return make_new_set(type, iterable);
}The new set starts with capacity 8, then set_update_internal iterates over the source set (or any iterable) and inserts each element using the same set_add_key / set_add_entry path.
Update method: merging iterables in‑place
set.update(iter1, iter2, …)loops over the tuple of arguments and calls set_update_internal for each. The helper handles three cases:
If the argument is another set (or frozenset), it calls set_merge for a fast bulk copy.
If the argument is a dict, it treats the dict’s keys as elements, optionally resizing beforehand.
Otherwise it obtains an iterator and adds each yielded element via set_add_key.
static PyObject *set_update(PySetObject *so, PyObject *args) {
Py_ssize_t i;
for (i = 0; i < PyTuple_GET_SIZE(args); i++) {
PyObject *other = PyTuple_GET_ITEM(args, i);
if (set_update_internal(so, other))
return NULL;
}
Py_RETURN_NONE;
}Union and Intersection
The article shows the implementation of set.intersection as an example of a binary operation that creates a new set. It first checks for identity, then creates an empty result set, optionally swaps the operands to iterate over the smaller one, and for each element uses set_contains_entry to test membership before inserting via set_add_entry. The same pattern underlies union, difference, and symmetric_difference.
static PyObject *set_intersection(PySetObject *so, PyObject *other) {
PySetObject *result;
if ((PyObject *)so == other)
return set_copy(so, NULL);
result = (PySetObject *)make_new_set_basetype(Py_TYPE(so), NULL);
if (result == NULL)
return NULL;
if (PyAnySet_Check(other)) {
// iterate over the smaller set
...
if (set_contains_entry(so, key, hash))
set_add_entry(result, key, hash);
...
} else {
// generic iterable path
...
}
return (PyObject *)result;
}Other set operations
The article lists the remaining built‑in methods and their operator equivalents: s1 & s2 – intersection s1 | s2 – union s1 - s2 – difference s1 ^ s2 – symmetric difference s1 == s2 – equality test s1 >= s2 / s1 <= s2 – subset/superset checks s1 > s2 / s1 < s2 – proper subset/superset
When using the operator forms, both operands must be set objects; the method forms accept any iterable.
Summary
By exposing the actual C source, the article reveals that Python’s set operations rely on a hash table with linear probing, a special dummy tombstone to support deletions, and a finger pointer that makes successive pop calls O(1) on average. Resizing decisions are based on the ratio of active + dummy entries to the table mask, and the implementation reuses existing code paths for dicts and generic iterables, demonstrating the language’s emphasis on code reuse and performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Satori Komeiji's Programming Classroom
Python and Rust developer; I write about any topics you're interested in. Follow me! (#^.^#)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
