Inside CPython’s Garbage Collector: Ref Counting, Cycle Detection & Generational Tricks
CPython’s garbage collector combines reference counting with a cyclic collector that uses generational strategies, fat pointers, and optimized object structures to efficiently identify and reclaim unreachable objects, handling cycles, weak references, and memory layout details while minimizing overhead.
Abstract
Almost all high‑level languages have built‑in garbage collection; Python is no exception. The official Python development guide describes the garbage‑collector algorithm in detail; this article is a translation.
Reference Counting
CPython primarily uses reference counting. Each object stores a count of how many references point to it; when the count drops to zero the object is deallocated. The sys.getrefcount function returns the reference count (the returned value is one higher because the function itself holds a temporary reference).
>> x = object()
>>> sys.getrefcount(x)
2
>>> y = x
>>> sys.getrefcount(x)
3
>>> del y
>>> sys.getrefcount(x)
2Reference counting cannot handle reference cycles. Example of a cyclic reference:
>> container = []
>>> container.append(container)
>>> sys.getrefcount(container)
3
>>> del containerIn this example the list container contains a reference to itself, so removing one external reference does not reduce the count to zero; the object remains alive. An additional cycle‑collector is required to break such cycles.
Memory Layout and Object Structure
Typical Python objects are represented in C as:
object -----> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ \
| ob_refcnt |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyObject_HEAD
| *ob_type |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ /
| ... |To support garbage collection, extra fields are added before the usual object header:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ \
| *_gc_next |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyGC_Head
| *_gc_prev |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ /
| ob_refcnt |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyObject_HEAD
| *ob_type |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ /
| ... |The extra fields _gc_next and _gc_prev form a doubly‑linked list of GC‑managed objects. The list can be traversed to find unreachable objects.
Cycle Detection
The gc module implements the cycle‑detection algorithm. It scans container objects (lists, dicts, custom class instances, etc.) and builds two doubly‑linked lists: one for objects to be scanned and one for temporarily unreachable objects. Typical cyclic patterns include exception objects referencing tracebacks, module‑level functions referencing module dictionaries, class instances referencing their classes, and graph‑like data structures.
>> import gc
>>> class Link:
... def __init__(self, next_link=None):
... self.next_link = next_link
>>> link_3 = Link()
>>> link_2 = Link(link_3)
>>> link_1 = Link(link_2)
>>> link_3.next_link = link_1
>>> A = link_1
>>> del link_1, link_2, link_3
>>> link_4 = Link()
>>> link_4.next_link = link_4
>>> del link_4
>>> gc.collect()
2During a collection run the GC first decrements a temporary gc_ref field for each reference; objects whose gc_ref becomes zero are moved to the “unreachable” list. A second pass performs a breadth‑first search to confirm which objects are truly unreachable before they are finally reclaimed.
Why Move Unreachable Objects?
Moving only unreachable objects reduces the number of pointer updates because most objects are reachable. After the first collection the objects are ordered opposite to their creation order, which speeds up later collections.
Destroying Unreachable Objects
Process and clear weak references; invoke callbacks only when the weak‑ref itself is reachable.
If the object has a legacy finalizer ( tp_del), move it to gc.garbage.
Call the object’s finalizer ( tp_finalize) and mark it as finalized.
Handle objects that become resurrected during finalization.
For each remaining unreachable object, call tp_clear to break cycles and let reference‑count based destruction run.
Optimization: Generational Collection
CPython groups container objects into three generations. New objects start in generation 0; objects that survive a collection are promoted to the next generation. Each generation has its own threshold; when the net number of new objects exceeds threshold_0 a collection of generation 0 is triggered, and similarly for higher generations. Full collection of generation 2 occurs only when the ratio of long‑lived pending objects exceeds 25 %.
>> import gc
>>> gc.get_threshold()
(700, 10, 10)Optimization: Fat Pointers
The two pointer fields in PyGC_Head are “fat” pointers: low bits are repurposed to store flags such as PREV_MASK_COLLECTING, _PyGC_PREV_MASK_FINALIZED, and NEXT_MASK_UNREACHABLE. During collection the low bits temporarily hold the temporary gc_ref value.
Optimization: Delayed Management of Container Objects
Containers that cannot participate in cycles (e.g., tuples containing only immutable objects) are not tracked by the GC. The GC checks at creation time and during collection whether a container can be untracked. Functions like gc.is_tracked(obj) report the current tracking status.
>> gc.is_tracked([])
True
>>> gc.is_tracked({})
False
>>> gc.is_tracked({"a": []})
TrueThese optimizations reduce the overhead of the collector while preserving correctness.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
