线上一个查询简化如下:
Selectdt,count(distinct c1) , count(distinct case when c2>0 and c1=0 then c1 end),count(distinct case when c2>0 and c1>0 then c1 end) from t where dtbetween ‘20131108’ and ‘20131110’ group by dt;
一个让人头痛的multi-distinct问题,为什么说很头痛,看看执行计划就清楚了:
查看执行计划(省去非关键部分):
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages:Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 is a root stage
可以看到每个single-distinct都是独立的stage,因此可以设置上面的参数,这里既然每个stage是独立的那么是不是可以设置hive.exec.parallel,hive.exec.parallel.thread.number这两个参数来以资源换时间呢?故事总是残忍的,这里还有一个Bug(https://issues.apache.org/jira/browse/HIVE-4436), 因此在Hive0.12 release前是没办法的,这就叫有钱没地方花.
另外也有通过unionall+sum的解决方法,感兴趣的同学也可以尝试一下.
相关阅读: