Which Histogram Will Oracle Pick?
Have you ever wondered how does Oracle determine what type of histograms to create?
As of 12c there are 4 types of histograms. Are you familiar with these types? How does Oracle decide which ones to choose out of the four?
- frequency
- top frequency
- height balanced
- hybrid
If you need an introduction to histograms, you can check out last week’s post: Intro To Histograms. Three Basic Questions Answered.
Let me first introduce you to each type of histogram, so you have a better understanding.
Frequency Histograms
If the histogram on your column is a frequency histogram, this means that there is a bucket for each distinct value of the column.
Here is an example:
You have the column COLOR, with 4 distinct values: red, blue, green, yellow. Let’s say there are 100 rows in the table that have the color red, 50 rows in the table that have the color blue, 10 rows in the table that have the color green, and 2 rows in the table that have the color yellow.
In this case, there will be 4 buckets in the histogram (one for each distinct value), red bucket with 100 values, blue bucket with 50 values, green bucket with 10 values, and the yellow bucket with 2 values.
For frequency histograms, each bucket can have a different number of values in it, which means some buckets could be more full than others.
Top Frequency Histograms
These type of histograms were introduced with 12c, and are similar to the frequency histograms. The difference is that the top frequency histograms ignore non-popular values that are statistically insignificant.
Let’s look at an example.
We have the column CITY, which has 300 distinct values for table A. This table A has 100,000 rows.
There are 20,000 rows with the column CITY as Toronto, 30,000 rows with the column CITY as Vancouver, 35,000 rows with the column CITY as Calgary. The rest of the 15,000 rows are divided between the rest of the 297 distinct values. However there are a few city values, such as Airdrie and Kelowna with 5 values each.
These 2 values will be considered non-popular values, and will be ignored (there will be no bucket for them).
Height-Balanced Histograms
The Height-Balanced Histograms are legacy type histograms. The column values are divided into buckets, and each bucket contains approximately the same number of rows. These type of histograms are not very “popular” in 12c and up.
Hybrid Histograms
These types of histograms were also introduced with 12c. They are a combination of height-balanced histograms and frequency histograms.
Oracle says it is the “best of both worlds”.
If you want to check whether or not a histogram exists on the columns of a table, query DBA_TAB_COL_STATISTICS.
SQL> select COLUMN_NAME,NUM_DISTINCT,NUM_BUCKETS,HISTOGRAM from DBA_TAB_COL_STATISTICS where table_name='A' and owner='DR' order by column_name; COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM ---------------- ------------ ----------- --------------- COL_1 674 254 HYBRID COL_2 5 5 FREQUENCY COL_3 8 1 NONE COL_4 1348 1 NONE COL_5 1361 254 HYBRID COL_6 1346 254 HYBRID COL_7 299 254 TOP-FREQUENCY COL_8 5 5 FREQUENCY 8 rows selected.
If you want more information about the histogram itself that you have on a specific column, you can further query DBA_HISTOGRAMS view.
Be aware that there will be one row for each bucket of a histogram in the table:
SQL> select column_name, count(*) from DBA_HISTOGRAMS where table_name='A' and owner='DR' group by column_name order by column_name; COLUMN_NAME COUNT(*) --------------- -------- COL_1 254 COL_2 5 COL_3 2 COL_4 2 COL_5 254 COL_6 254 COL_7 254 COL_8 5 8 rows selected.
In the above examples, have you noticed that:
- for frequency histograms, the number of distinct values is equal to the number of buckets.
- for hybrid histograms the number of distinct values is always greater than the number of buckets.
- for top-frequency histograms the number of distinct values is always greater than the number of buckets.
Also have you noticed that the maximum number if buckets is 254? You could have more buckets, however this is the default number of buckets that Oracle is using. If you do not specify explicitly the number of buckets when gathering stats, then by default the number of buckets is 254.
The height balanced histograms are not very common in 12c and up. Just running a quick select statement in my 12.1 environment, will prove that I have 0 histogram of type height balanced:
SQL> select count(*), histogram from DBA_TAB_COL_STATISTICS group by histogram; COUNT(*) HISTOGRAM ---------- --------------- 799 FREQUENCY 71 HYBRID 2 TOP-FREQUENCY 32689 NONE
Now that you have a better understanding about histogram types, let’s explore how will Oracle decide which type of histograms to use.
Oracle is using 4 variables to determine the type of histogram to use. These variables are listed below:
1) number of distinct values in the column -> NDV. If the column in the table has the values red, blue, yellow, green, then NDV=4
2) number of histogram buckets -> n. The default value for n is 254. If you don’t explicitly specify the number of buckets, then by default Oracle will set the number of buckets to 254.
3) internal variable, percentage threshold, p = (1–(1/n)) * 100; This is a formula set by Oracle.
4) the value of estimate_percent, when you gather statistics, whether or not it’s set to default value AUTO_SAMPLE_SIZE, when gathering statistics.
Here are the rules, set by Oracle when picking a specific type of histogram:
Oracle will pick a Frequency Histogram if:
NDV < n (n=254 by default)
if number of distinct values is less than the number of histogram buckets.
Oracle will pick a Height Balanced Histogram if:
NDV > n (n=254 by default)
estimate_percent < > AUTO_SAMPLE_SIZE
if number of distinct values is greater than the number of histogram buckets
and
the estimate percent is not set to AUTO_SAMPLE_SIZE during stats gathering.
Oracle will pick a Hybrid Histogram if:
NDV > n (n=254 by default)
estimate_percent = AUTO_SAMPLE_SIZE
percentage of rows for top n (n=254 by default) frequent values < p
If n=254, then p is 99.6
Oracle will pick a Top Frequency Histogram if
NDV > n (n=254 by default)
estimate_percent = AUTO_SAMPLE_SIZE
percentage of rows for top n (n=254 by default) frequent values >= p
If you enjoyed this article, and would like to learn more about databases, please sign up below, and you will receive
The Ultimate 3 Step Guide To Find The Root Cause Of The Slow Running SQL!
–Diana
Great job explaining the concepts Diana.