Fundamentals 12 min read

Comparing Python List and Set Lookup Performance and Their Underlying Implementations

This article benchmarks Python list versus set lookups on large datasets, explains why sets are dramatically faster, and delves into the internal C‑level implementations of list as a dynamic array and set/dict as hash tables, including resizing rules and collision handling.

Python Programming Learning Circle

Jan 24, 2022

Comparing Python List and Set Lookup Performance and Their Underlying Implementations

The article investigates the performance difference between searching large data sets with Python list and set, showing that sets provide far superior lookup speed.

import time
import random
nums = [random.randint(0, 2000000) for i in range(1000)]
list_test = list(range(1000000))
set_test = set(list_test)
count_list, count_set = 0, 0
t1 = time.time()  # test list lookup
for num in nums:
    if num in list_test:
        count_list += 1
t2 = time.time()
for num in nums:  # test set lookup
    if num in set_test:
        count_set += 1
t3 = time.time()
print('找到个数，列表：{}，集合：{}'.format(count_list, count_set))
print('使用时间，列表：{:.4f}s'.format(t2 - t1))
print('使用时间，集合：{:.4f}s'.format(t3 - t2))

The output demonstrates that the list lookup takes about 7.9 seconds while the set lookup completes in roughly 0.001 seconds, highlighting the efficiency gap.

Python list is implemented as a dynamic array of pointers. The underlying C structure is:

typedef struct {
  PyObject_VAR_HEAD
  PyObject **ob_item;
  Py_ssize_t allocated;
} PyListObject;

Each element is a pointer to a Python object, and the array resizes according to a specific rule when it becomes full:

new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6)
new_allocated += newsize

This strategy doubles capacity when needed and halves it when the usage falls below half, but frequent insert‑delete cycles can cause excessive copying.

For homogeneous data, the article suggests using the array module or numpy for more compact memory layouts and faster access.

Python set (and dict) are built on hash tables. Keys are hashed to an index, and collisions are resolved using open addressing with pseudo‑random probing. The entry states are:

Unused : both key and value are NULL.

Active : key is non‑NULL and not a dummy, value is non‑NULL.

Dummy : key is a special dummy object, value is NULL; used to preserve probe chains after deletions.

Open addressing ensures that even after deletions the probe sequence remains intact, preventing lookup failures.

Compared to lists (O(n) lookup), sets and dicts achieve average O(1) lookup, insertion, and deletion, provided the load factor stays below about 0.75. Python automatically resizes the hash table when the load factor reaches two‑thirds.

Overall, the article explains why sets are dramatically faster for membership tests, how Python’s internal data structures work, and offers practical tips for optimizing performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Python Data Structures List hash table Set dynamic array

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.