Understanding PostgreSQL Cost Estimation: Statistics, Physical Paths, and Optimization Techniques
This article uses a dialogue‑driven story to explain how PostgreSQL gathers statistics, computes selection rates, evaluates various physical scan and join paths, and combines CPU, I/O, and communication costs—including startup versus total cost—to choose the most efficient execution plan.
Author Background
Zhang Shujie, author of PostgreSQL Technical Insights: Deep Dive into Query Optimization and a kernel developer at Pivotal, presents the material as a conversational story to make complex optimizer concepts easier to grasp.
Statistics and Selection Rate
The optimizer relies heavily on table statistics to estimate the cost of physical operators. Accurate statistics describe data distribution, enabling the planner to predict how many rows survive a predicate. The selection rate is the fraction of rows filtered out by a condition, calculated from statistics.
Typical statistics include distinct‑value rate, NULL‑value rate, most‑common values, histograms, and column correlation. Distinct‑value rate tells how many unique values a column has; NULL‑value rate isolates NULLs; high‑frequency values capture outliers; histograms model value distribution; correlation helps estimate the cost of index scans.
Physical Scan and Join Paths
Physical paths are the leaf nodes of an execution plan. Scan paths include:
Sequential Scan – reads the whole table; simple but often costly.
Index Scan – uses an index to locate qualifying tuples, reducing I/O but introducing random reads.
Bitmap Scan – builds a bitmap of qualifying tuple locations, then reads them in order, mitigating random‑read overhead.
TID Scan – directly fetches tuples by their physical address, offering very fast access.
Join paths include:
Nested‑Loop Join – simple O(m·n) algorithm; can be accelerated when the inner side uses an Index Scan (O(m·log n)).
Hash Join – builds a hash table on the inner side (cost ≈ O(m·n/N) where N is the number of hash buckets).
Merge Join – requires sorted inputs; sorting cost can be avoided if an ordered index is available.
Cost Units
PostgreSQL defines a base I/O cost of 1 for a sequential page read/write and treats a random page read as 4 × the sequential cost. CPU cost covers tuple extraction, expression evaluation, and other processing. Parallel execution adds a communication cost. The total cost is the sum of these components: Total Cost = CPU Cost + I/O Cost + Communication Cost Cost is also split into:
Startup Cost – the expense incurred before the first row is returned.
Total Cost – the expense to produce all rows.
Choosing a plan based solely on total cost can be misleading when a query has a LIMIT clause; a lower startup cost may be preferable.
Example Plans and Startup vs. Total Cost
For SELECT * FROM TEST_A WHERE a > 1 ORDER BY a LIMIT 1; two possible plans are:
SeqScan → Sort → Limit (higher startup cost, lower total cost).
IndexScan → Limit (lower startup cost, higher total cost due to random I/O).
The optimizer must weigh startup cost against total cost, especially when a LIMIT is present.
Path Search Methods
PostgreSQL uses a bottom‑up dynamic programming approach to explore join orders. It builds “layers” of intermediate join results, reusing optimal sub‑plans to avoid exhaustive enumeration. Diagrams in the original story illustrate first‑level tables, second‑level joins, and subsequent layers leading to the final optimal path.
Because the search space grows quickly with many tables, PostgreSQL also employs a stochastic method based on a genetic algorithm. The algorithm proceeds as follows:
Population Initialization : Encode possible plans as chromosomes and compute their fitness (plan cost).
Selection : Probabilistically choose chromosomes for reproduction.
Crossover : Combine parts of two chromosomes to create offspring.
Mutation : Randomly alter chromosomes to maintain diversity.
Fitness Evaluation : Discard low‑fitness plans.
In PostgreSQL’s implementation, mutation is omitted; new plans are generated only via crossover, but the overall idea remains the same.
Conclusion
The story demonstrates how statistics, selection rates, physical path types, cost modeling, and search algorithms together enable PostgreSQL to choose efficient execution plans. Readers are encouraged to consult the author’s book for deeper implementation details.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
