exec dbms_stats.gather_table_stats(null, 'BIGT',

estimate_percent => 1);

It is not always easy for users to pick the appropriate sampling percentage. If the specified sampling percentage is too high, it can take longer to gather statistics. On the contray, if the data is skewed and the specified sampling percentage is too low, the resulting statistics can be inaccurate.

For this reason, Oracle introduced the AUTO value for the estimate_percent parameter. For example, you can gather statistics on BIGT as follows:

exec dbms_stats.gather_table_stats(null, 'BIGT',

estimate_percent => dbms_stats.auto_sample_size);

When AUTO is used Oracle picks a sample size where the statistics quality is good enough. However, it does not work very well under cases of extreme skew in the data. In Oracle 11g, we improved the behavior when the AUTO value is used. First, AUTO sampling now generates deterministic statistics. Second, and more importantly, AUTO sampling generates statistics that are almost as accurate as 100% sampling but takes much less time than 100% sampling. To illustrate these merits, we compare the performance of using a fixed sampling percentage, AUTO sample size in Oracle 10g and AUTO sample size in Oracle 11g.

We used the standard TPC-D data generator to generate a Lineitem table. The Lineitem table is about 230G and contains 1.8 million rows with 16 columns. The schema of the lineitem table is as follows:

column name | column type |

l_shipdate | date |

l_orderkey | number |

l_discount | number |

l_extendedprice | number |

l_suppkey | number |

l_quantity | number |

l_returnflag | varchar2 |

l_partkey | number |

l_linestatus | varchar2 |

l_tax | number |

l_commitdate | date |

l_receiptdate | date |

l_shipmode | varchar2 |

l_linenumber | number |

l_shipinstruct | varchar2 |

l_comment | varchar2 |

Table 1 gives the elapsed time of gathering statistics on the Lineitem table by different sampling percentages.

Sampling Percentage | Elapsed Time (sec) |

1% sampling | 797 |

100% sampling (Compute) | 18772 |

Auto sampling in Oracle 10g | 2935 |

Auto sampling in Oracle 11g | 1908 |

Table 1: Statistics gathering time on 230G TPC-D Lineitem Table Using Different Estimate Percentages

We also compare the quality of the statistics gathered using different estimate percentages. Among all the statistics of a column, number of distinct values (NDV) is the one whose accuracy used to be an issue. We define the accuracy rate of NDV of a column as follows:

accuracy rate = 1 - (|estimated NDV - actual NDV|)/actual NDV.

The accuracy rate ranges from 0% to 100%. The higher the accuracy rate is, the more accurate the gathered statistics are. Since 100% sampling always lead to an accuracy rate of 100%, we do not report it. We focus on the columns which has at least one statistics accuracy rate below 99.9% when using different estimate percentages. Table 2 illustrates the accurate rates of the columns.

Column Name | Actual NDV | Auto Sampling in Oracle 11g | 1% Sampling |

orderkey | 450,000,000 | 98.0% | 50% |

comment | 181,122,127 | 98.60% | 4.60% |

partkey | 60,000,000 | 99.20% | 98.20% |

suppkey | 3,000,000 | 99.60% | 99.90% |

extendedprice | 3,791,320 | 99.60% | 94.30% |

Table 2: Accuracy Rate of Gathering NDV LineItem Using Different Estimate Percentages

In short, the elapsed time of ''AUTO sampling in Oracle 11g'' is 10 times faster than 100% sampling but yields the statistics of similar quality (accuracy rate close to 100%).

## 5 comments:

This is fantastic information. We use Oracle 9i currently (10g shortly) and find the AUTO sample size undesirable in our current implementation.

What we have found in our Data Warehouse is that 90-95% accuracy is more desirable than near 100% accuracy if there is a significant speed increase in gathering stats.

I would hope in a future version of Oracle that perhaps we could set a target accuracy as a parameter. I wonder what difference that would make in the speed of gathering stats.

Regardless, we would look forward to playing with the new enhancements in 11g. Thanks!

And what was the accuracy of the 10g AUTO sampling?

Pretty interesting stuff! However, I am missing few bits of information:

What does "deterministic statistics" mean? In how far were statistics non deterministic in older versions of Oracle? Does this refer to the selection of rows for analysis?

A column "Auto sampling in Oracle 10g" in table 2 would be great so we can see the difference in accuracy between Oracle 10 and 11 auto sampling.

And a hint about what changed in the sampling process would be great in order to help DBA's decide when they can safely rely on auto sampling and when they cannot.

Just my 0.02 EUR.

I second Robert's questions; especially I would like to know a bit more about the autosampling algorithm, for the reasons Robert expressed but also out of sheer curiosity.

Good Information. Please keep it coming.

Post a Comment