Databases 10 min read

Why Oracle’s AUTO_SAMPLE_SIZE Misses Skewed Data and How to Fix It

The article explains how Oracle’s default DBMS_STATS.AUTO_SAMPLE_SIZE can produce wildly inaccurate cardinality estimates on highly skewed columns, examines the NewDensity and OldDensity algorithms, demonstrates the problem with real‑world examples, and offers practical solutions such as forcing estimate_percent or disabling NewDensity.

dbaplus Community

Oct 24, 2016

Why Oracle’s AUTO_SAMPLE_SIZE Misses Skewed Data and How to Fix It

Background

Since Oracle 11g, the optimizer recommends using DBMS_STATS.AUTO_SAMPLE_SIZE instead of manually setting ESTIMATE_PERCENT. While this works well for most cases, extreme data skew can cause the automatically chosen sample size to be too low, leading to inaccurate histograms and poor execution plans.

Case Study

A customer reported that a query on USERSSA1.LINEDetail with a predicate on GROUPNO used a full table scan despite an index on that column. The optimizer estimated 2.5 million rows for the predicate GROUPNO = '0000260455', while only five rows actually matched.

Analysis of the table’s statistics showed a frequency histogram with NUM_BUCKETS = 1 and only 17 distinct values, despite the column having a high proportion of NULLs and a few non‑NULL values.

The optimizer’s estimate was derived from the NewDensity factor (0.5) multiplied by the total row count, yielding the inflated estimate.

NewDensity vs. OldDensity

Two density factors are recorded in the 10053 trace: NewDensity and OldDensity. OldDensity is calculated as 0.5 / num_rows, resulting in a fixed estimate of 0.5 for non‑popular values, which the optimizer rounds to 1 in the plan. NewDensity was introduced in Oracle 10.2.0.4 to improve estimates for values not present in the frequency histogram. Its formula is:

However, when only one bucket exists, the calculation simplifies to NewDensity = 0.5, causing the optimizer to assume half the rows for any non‑popular value, which defeats the purpose of the algorithm.

Problems with AUTO_SAMPLE_SIZE

The default sampling can be as low as 0.1 % for columns with many duplicate values, such as GROUPNO. Even though the optimizer uses the APPROXIMATE_NDV algorithm to correctly count distinct values, the histogram still contains a single bucket, leading to poor cardinality estimates.

Attempts to increase the bucket count by specifying NUM_BUCKETS = 254 still resulted in a single bucket after re‑gathering statistics, and the plan remained a full table scan.

Why Buckets Can Be Incorrect

Extreme data skew causes the sampling process to miss non‑popular values.

Columns with long identical prefixes (>32 bytes) are treated as identical samples, reducing bucket diversity.

Test 1 – Extreme Skew

Data was generated with a heavy skew where most rows contain NULLs and only a few rows contain a specific value. The histogram collected no samples for the rare value, resulting in an estimated row count 500,000 times larger than reality.

Test 2 – Long Identical Prefixes

When column values share a long common prefix, the endpoint values in the histogram become truncated to 32 characters, making the histogram inaccurate and causing the optimizer to severely misestimate cardinalities.

Solution 1 – Manually Set ESTIMATE_PERCENT

Setting ESTIMATE_PERCENT = 100 forces a full scan of the table during statistics collection. After re‑gathering, the histogram contained 18 buckets, and the optimizer correctly estimated five rows for the predicate, using the index IDX_LINEDETAIL_GROUPNO.

For very large tables, a lower percentage can be used, but the key is to ensure sufficient sampling for skewed columns.

To make this permanent without disabling AUTO_SAMPLE_SIZE, the table preference can be set:

Solution 2 – Use OldDensity Algorithm

The NewDensity algorithm can be disabled via the _fix_control parameter, reverting to the older OldDensity calculation. This can be set globally with an ALTER SYSTEM command.

Prevention

DBAs can query DBA_TAB_HISTOGRAMS for columns where NUM_BUCKETS = 1 (or low values) to identify potential problem areas before they affect query performance.

Conclusion

Although Oracle’s optimizer and automatic statistics collection have become smarter, extreme data skew still requires manual intervention. By adjusting sampling percentages, disabling NewDensity, or customizing table preferences, DBAs can ensure accurate statistics and optimal execution plans.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

statistics database optimization Oracle DBMS_STATS Auto Sample Size NewDensity

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.