Improvementsto the Hive Optimizer - 数据库编程

e.optimize.bucketmapjoin.sortedmerge = true;

sethive.auto.convert.sortmerge.join.noconditionaltask=true;

这里有一个选项去设置大表选择策略(big table selection policy)：

set hive.auto.convert.sortmerge.join.bigtable.selection.policy

= org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ;

默认情况下为平均分区大小，这个大表策略有助于确定选择stream，相比是hash还是流来说的。

可用的选择策略列表是：

org.apache.hadoop.hive.ql.optimizer.AvgPartitionSizeBasedBigTableSelectorForAutoSMJ(default)

org.apache.hadoop.hive.ql.optimizer.LeftmostBigTableSelectorForAutoSMJ

org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ

从名字上面就可以判断用途，特别是用在fact和fact的join中。

GenerateHash Tables on the Task Side

未来的版本，可能会把哈希放到task side(当前是放在客户端生成的)。

Prosand Cons of Client-Side Hash Tables (在客户端生成hash表的优缺点)

无论是生成哈希，还是多表的哈希join都有问题。因为客户端的机器都是用来跑hive客户端或者用来提交job的。

缺点：

?Data locality: The client machine typically is not a data node. Allthe data accessed is remote and has to be read via the network.

数据分布：客户端机器一般都不是数据节点，所有的数据访问都是远程的，必须通过网络读取。

?Specs: For the same reason, it is not clear what the specificationsof the machine running this processing will be. It might have limitations inmemory, hard drive, or CPU that the task nodes do not have.

空间：出于同样的原因，这是不清楚的机器有点什么。任务节点上的内存，硬盘，cpu情况不清楚。

?HDFS upload: The data has to be brought back to the clusterand replicated via the distributed cache to be used by task nodes.

HDFS数据上传：数据返回到集群和被复制都要通过task节点的分布式缓存。

好处：

?What is stored in the distributed cache islikely to be smaller than the original table (filter and projection).

因为做了filter或者投影，生成的哈希表（到分布式缓存）可能比原始的表要小。

?In contrast, loading hashtables directly onthe task nodes using the distributed cache means larger objects in the cache,potentially reducing opportunities for using MAPJOIN.

相比之下，如果在task端直接使用分布式缓存加载哈希表，意味着缓存会占用大表占用，间接的减少使用mapjoin的可能性。

Task-SideGeneration of Hash Tables task端生成哈希

当在task端生成哈希时，所有任务节点必须访问原始数据源生成的哈希表(同时去访问统一资源)。在正常情况下，这一操作是并行的，不会导致延迟，但是hive有一个概念，就是多任务同时访问外部的数据源，如HBase，Database等，这样就有可能导致延迟了。

FurtherOptions for Optimization 未来的优化方向

1.Increasethe replication factor on dimension tables. ----增加维表的复制因子

2.Use the distributedcache to hold dimension tables. ----使用分布式缓存来存放维表

Improvementsto the Hive Optimizer(三)