Inside the Oracle Optimizer - Removing the black magic: January 2008

Optimizer statistics in Oracle are managed via a pl/sql package, dbms_stats. It provides several pl/sql procedures to gather statistics for a table, schema, or a database. For example, gather_table_statistics is used to gather statistics on a table. This procedure has an estimate_percent parameter, which specifies the sampling percentage of the statistics gathering. The users can specify any number between 0 ~ 100 for this parameter. For example, suppose you have a table BIGT, you can specify a 1% sampling percentage as follows:

exec dbms_stats.gather_table_stats(null, 'BIGT', 
                       estimate_percent => 1);

It is not always easy for users to pick the appropriate sampling percentage. If the specified sampling percentage is too high, it can take longer to gather statistics. On the contray, if the data is skewed and the specified sampling percentage is too low, the resulting statistics can be inaccurate.

For this reason, Oracle introduced the AUTO value for the estimate_percent parameter. For example, you can gather statistics on BIGT as follows:

exec dbms_stats.gather_table_stats(null, 'BIGT', 
                       estimate_percent => dbms_stats.auto_sample_size);

The advantage of using AUTO sample size over a fixed number is two-folds. First, when AUTO sample size is specified, the system automatically determines the appropriate sampling percentage. Second, AUTO sample size is more flexible than a fixed sampling percentage. A fixed sampling percentage size that was good at some point in time may not be appropriate after the data distribution in the table has changed. On the other hand when AUTO value is used Oracle will adjust the sample size when the data distribution changes.

When AUTO is used Oracle picks a sample size where the statistics quality is good enough. However, it does not work very well under cases of extreme skew in the data. In Oracle 11g, we improved the behavior when the AUTO value is used. First, AUTO sampling now generates deterministic statistics. Second, and more importantly, AUTO sampling generates statistics that are almost as accurate as 100% sampling but takes much less time than 100% sampling. To illustrate these merits, we compare the performance of using a fixed sampling percentage, AUTO sample size in Oracle 10g and AUTO sample size in Oracle 11g.

We used the standard TPC-D data generator to generate a Lineitem table. The Lineitem table is about 230G and contains 1.8 million rows with 16 columns. The schema of the lineitem table is as follows:

column name	column type
l_shipdate	date
l_orderkey	number
l_discount	number
l_extendedprice	number
l_suppkey	number
l_quantity	number
l_returnflag	varchar2
l_partkey	number
l_linestatus	varchar2
l_tax	number
l_commitdate	date
l_receiptdate	date
l_shipmode	varchar2
l_linenumber	number
l_shipinstruct	varchar2
l_comment	varchar2

Table 1 gives the elapsed time of gathering statistics on the Lineitem table by different sampling percentages.

Sampling Percentage	Elapsed Time (sec)
1% sampling	797
100% sampling (Compute)	18772
Auto sampling in Oracle 10g	2935
Auto sampling in Oracle 11g	1908

Table 1: Statistics gathering time on 230G TPC-D Lineitem Table Using Different Estimate Percentages

We also compare the quality of the statistics gathered using different estimate percentages. Among all the statistics of a column, number of distinct values (NDV) is the one whose accuracy used to be an issue. We define the accuracy rate of NDV of a column as follows:


accuracy rate = 1 - (|estimated NDV - actual NDV|)/actual NDV.

The accuracy rate ranges from 0% to 100%. The higher the accuracy rate is, the more accurate the gathered statistics are. Since 100% sampling always lead to an accuracy rate of 100%, we do not report it. We focus on the columns which has at least one statistics accuracy rate below 99.9% when using different estimate percentages. Table 2 illustrates the accurate rates of the columns.

Column Name	Actual NDV	Auto Sampling in Oracle 11g	1% Sampling
orderkey	450,000,000	98.0%	50%
comment	181,122,127	98.60%	4.60%
partkey	60,000,000	99.20%	98.20%
suppkey	3,000,000	99.60%	99.90%
extendedprice	3,791,320	99.60%	94.30%

Table 2: Accuracy Rate of Gathering NDV LineItem Using Different Estimate Percentages

In short, the elapsed time of ''AUTO sampling in Oracle 11g'' is 10 times faster than 100% sampling but yields the statistics of similar quality (accuracy rate close to 100%).

Inside the Oracle Optimizer - Removing the black magic

Got a question for the Optimizer team?

Tuesday, January 22, 2008

Improvement of AUTO sampling statistics gathering feature in Oracle 11g

Disclaimer

Upcoming events & presentations

Categories

Blog Archive

About Me