Why Oracle’s AUTO_SAMPLE_SIZE Misses Skewed Data and How to Fix It
The article explains how Oracle’s default DBMS_STATS.AUTO_SAMPLE_SIZE can produce wildly inaccurate cardinality estimates on highly skewed columns, examines the NewDensity and OldDensity algorithms, demonstrates the problem with real‑world examples, and offers practical solutions such as forcing estimate_percent or disabling NewDensity.
Background
Since Oracle 11g, the optimizer recommends using DBMS_STATS.AUTO_SAMPLE_SIZE instead of manually setting ESTIMATE_PERCENT. While this works well for most cases, extreme data skew can cause the automatically chosen sample size to be too low, leading to inaccurate histograms and poor execution plans.
Case Study
A customer reported that a query on USERSSA1.LINEDetail with a predicate on GROUPNO used a full table scan despite an index on that column. The optimizer estimated 2.5 million rows for the predicate GROUPNO = '0000260455', while only five rows actually matched.
Analysis of the table’s statistics showed a frequency histogram with NUM_BUCKETS = 1 and only 17 distinct values, despite the column having a high proportion of NULLs and a few non‑NULL values.
The optimizer’s estimate was derived from the NewDensity factor (0.5) multiplied by the total row count, yielding the inflated estimate.
NewDensity vs. OldDensity
Two density factors are recorded in the 10053 trace: NewDensity and OldDensity. OldDensity is calculated as 0.5 / num_rows, resulting in a fixed estimate of 0.5 for non‑popular values, which the optimizer rounds to 1 in the plan. NewDensity was introduced in Oracle 10.2.0.4 to improve estimates for values not present in the frequency histogram. Its formula is:
However, when only one bucket exists, the calculation simplifies to NewDensity = 0.5, causing the optimizer to assume half the rows for any non‑popular value, which defeats the purpose of the algorithm.
Problems with AUTO_SAMPLE_SIZE
The default sampling can be as low as 0.1 % for columns with many duplicate values, such as GROUPNO. Even though the optimizer uses the APPROXIMATE_NDV algorithm to correctly count distinct values, the histogram still contains a single bucket, leading to poor cardinality estimates.
Attempts to increase the bucket count by specifying NUM_BUCKETS = 254 still resulted in a single bucket after re‑gathering statistics, and the plan remained a full table scan.
Why Buckets Can Be Incorrect
Extreme data skew causes the sampling process to miss non‑popular values.
Columns with long identical prefixes (>32 bytes) are treated as identical samples, reducing bucket diversity.
Test 1 – Extreme Skew
Data was generated with a heavy skew where most rows contain NULLs and only a few rows contain a specific value. The histogram collected no samples for the rare value, resulting in an estimated row count 500,000 times larger than reality.
Test 2 – Long Identical Prefixes
When column values share a long common prefix, the endpoint values in the histogram become truncated to 32 characters, making the histogram inaccurate and causing the optimizer to severely misestimate cardinalities.
Solution 1 – Manually Set ESTIMATE_PERCENT
Setting ESTIMATE_PERCENT = 100 forces a full scan of the table during statistics collection. After re‑gathering, the histogram contained 18 buckets, and the optimizer correctly estimated five rows for the predicate, using the index IDX_LINEDETAIL_GROUPNO.
For very large tables, a lower percentage can be used, but the key is to ensure sufficient sampling for skewed columns.
To make this permanent without disabling AUTO_SAMPLE_SIZE, the table preference can be set:
Solution 2 – Use OldDensity Algorithm
The NewDensity algorithm can be disabled via the _fix_control parameter, reverting to the older OldDensity calculation. This can be set globally with an ALTER SYSTEM command.
Prevention
DBAs can query DBA_TAB_HISTOGRAMS for columns where NUM_BUCKETS = 1 (or low values) to identify potential problem areas before they affect query performance.
Conclusion
Although Oracle’s optimizer and automatic statistics collection have become smarter, extreme data skew still requires manual intervention. By adjusting sampling percentages, disabling NewDensity, or customizing table preferences, DBAs can ensure accurate statistics and optimal execution plans.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
