Comparing Python List and Set Lookup Performance and Their Underlying Implementations
This article benchmarks Python list versus set lookups on large datasets, explains why sets are dramatically faster, and delves into the internal C‑level implementations of list as a dynamic array and set/dict as hash tables, including resizing rules and collision handling.
The article investigates the performance difference between searching large data sets with Python list and set , showing that sets provide far superior lookup speed.
<code>import time
import random
nums = [random.randint(0, 2000000) for i in range(1000)]
list_test = list(range(1000000))
set_test = set(list_test)
count_list, count_set = 0, 0
t1 = time.time() # test list lookup
for num in nums:
if num in list_test:
count_list += 1
t2 = time.time()
for num in nums: # test set lookup
if num in set_test:
count_set += 1
t3 = time.time()
print('找到个数,列表:{},集合:{}'.format(count_list, count_set))
print('使用时间,列表:{:.4f}s'.format(t2 - t1))
print('使用时间,集合:{:.4f}s'.format(t3 - t2))
</code>The output demonstrates that the list lookup takes about 7.9 seconds while the set lookup completes in roughly 0.001 seconds, highlighting the efficiency gap.
Python list is implemented as a dynamic array of pointers. The underlying C structure is:
<code>typedef struct {
PyObject_VAR_HEAD
PyObject **ob_item;
Py_ssize_t allocated;
} PyListObject;
</code>Each element is a pointer to a Python object, and the array resizes according to a specific rule when it becomes full:
<code>new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6)
new_allocated += newsize
</code>This strategy doubles capacity when needed and halves it when the usage falls below half, but frequent insert‑delete cycles can cause excessive copying.
For homogeneous data, the article suggests using the array module or numpy for more compact memory layouts and faster access.
Python set (and dict ) are built on hash tables. Keys are hashed to an index, and collisions are resolved using open addressing with pseudo‑random probing. The entry states are:
Unused : both key and value are NULL .
Active : key is non‑NULL and not a dummy, value is non‑NULL.
Dummy : key is a special dummy object, value is NULL ; used to preserve probe chains after deletions.
Open addressing ensures that even after deletions the probe sequence remains intact, preventing lookup failures.
Compared to lists (O(n) lookup), sets and dicts achieve average O(1) lookup, insertion, and deletion, provided the load factor stays below about 0.75. Python automatically resizes the hash table when the load factor reaches two‑thirds.
Overall, the article explains why sets are dramatically faster for membership tests, how Python’s internal data structures work, and offers practical tips for optimizing performance.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.