解释直方图信息 - 数据库编程

解释直方图信息

适用于：
Oracle Database - Enterprise Edition - Version 7.3.0.0 and later
Oracle Database - Standard Edition - Version 7.3.0.0 and later
Oracle Database - Personal Edition - Version 7.3.0.0 and later
Information in this document applies to any platform.

目的：
直方图信息是怎么被存储的,是怎么被解释的。

范围：
其他有用的直方图参考：
Document 1445372.1 Histograms: An Overview (10g and Above)

细节：
直方图是一种机制，该机制用来存储列数据（column data）的详细信息。该数据被CBO使用，用来决定一个查询语句最优化的访问路径（access path）.
没有直方图时，优化器依靠的所有信息是：一个列的高值和低值，该列的不同值个数，该列的空值个数，该table的记录总数。
（实际上列的高值和低值是以raw 格式存储的，因此不是特别有用），其他的信息可以从dictionary views中查询到。

没有列的统计信息时，优化器假设数据是均衡分布的，对于等值谓词，生成以一个选择率(column selectivity)，该选择率是如下计算的：1/NVD(Number of Distinct Values)

有直方图时，你可以访问行数据的更多分布信息。

当一个列的数据分布不均衡时（即：列的数据分布高度倾斜--数据分布倾斜的很厉害），Oracle 可以存储列的直方图以给出更好的选择率.这会产生比使用标准的统计信息（high and low values plus Number of Distinct Values）更好的执行计划

就具体实现而言（In terms of implementation）,我们可以选择将每个不同值和该值的记录数存放在一起，对于值很少的记录数是有效的，此时，'width balanced' histograms 被使用。

随着不同值数量的增长，存储数据的数量变得过高，我们需要使用一个不同的方法来存储直方图数据。此时，我们可以选择 height balanced histograms.

使用如上两种方法，列直方图提供了一个有效和集中的方法来展现数据分布。当建立直方图时，存储的信息依靠“不同值的数量是否小于等于bucket（默认75个，最大254个）的数量”进行不同的解释。

如果不同值的数量小于等于直方图bucket的数量（bucket最多254个），那么 Frequency Histogram 被建立
如果不同值的数量大于直方图bucket的数量， Height Balanced Histogram 被建立。

Frequency Histogram
Frequency Histogram 使用bucket来记录每一个不同值的记录个数

Height Balanced Histogram
Height Balanced Histogram 通过把数据分割到不同bucket中来实现。每个bucket 包括相同数量的列值。每个bucket中的最高值（or END_POINT）和最低值被记录在零号bucket中。

一旦数据被存储于bucket中，我们可以识别两个类型的data value--- Non-popular values and popular values

Non-popular values--are those that do not occur multiple times as end points.不会出现多次
Popular values--occur multiple times as end points.会出现多次。

We can use Popular and Non-Popular Values to provide use with various statistics.Since we know how many values there are in a bucket we can use this information to estimate the number of rows in total that are covered by Popular and Non-Popular values.
?The selectivity for popular values can be obtained by calculation the proportion of bucket endpoints filled by that popular value.
?The selectivity for non popular values can now be calculated as 1/number non-popular bucket endpoints, so we can now be more accurate about selectivities than the original 1/NDV, because we have removed the popular values from the equation.

How histograms are used
直方图被用来得到column predicate 更好的selectivity 估算

Where there are fewer distinct values than buckets, the selectivity is simply calculated as we have accurate row information for each value. For the case where we have more distinct values than buckets, the following outlines how these selectivities are obtained.

Equality Predicate Selectivity calculated from:
?Popular Value:
Number of buckets for value / Total Number of buckets
?Non-Popular Value:
Density see:

Document 43041.1 Query Optimizer: What is Density?

Less than < (Same principle applies for > & >= )
?All Values:
Buckets with endpoints < value / Total No. of buckets

Histogram Examples

 Table TAB1

SQL> desc tab1
 Name                            Null?    Type
 ------------------------------- -------- ----
 A                                        NUMBER(6)
 B                                        NUMBER(6)

Column A contains unique values from 1 to 10000.
Column B contains 10 distinct values.

The value '5' occurs 9991 times.
Values '1, 2, 3, 4, 9996, 9997, 9998, 9999, 10000' occur only once.

i.e.

select distinct B , count(*)
from HTAB1
group by B
order by B
;

         B   COUNT(*)
---------- ----------
         1          1
         2          1
         3          1
         4          1
         5       9991
      9996

解释直方图信息(一)

Histogram Examples