Hive中表的关联顺序对生成MapReduce作业数的影响案例(二)

2014-11-24 17:41:27 · 作者: · 浏览: 1
d: true


GlobalTableId: 0


table:


input format: org.apache.hadoop.mapred.SequenceFileInputFormat


output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat



Stage: Stage-2


Map Reduce


Alias -> Map Operator Tree:


$INTNAME


Reduce Output Operator


key expressions:


expr: _col3


type: string


sort order: +


Map-reduce partition columns:


expr: _col3


type: string


tag: 0


value expressions:


expr: _col4


type: int


e


TableScan


alias: e


Reduce Output Operator


key expressions:


expr: k2


type: string


sort order: +


Map-reduce partition columns:


expr: k2


type: string


tag: 1


Reduce Operator Tree:


Join Operator


condition map:


Left Outer Join0 to 1


condition expressions:


0 {VALUE._col10}


1


handleSkewJoin: false


outputColumnNames: _col10


Select Operator


expressions:


expr: _col10


type: int


outputColumnNames: _col0


File Output Operator


compressed: true


GlobalTableId: 0


table:


input format: org.apache.hadoop.mapred.TextInputFormat


output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat



Stage: Stage-0


Fetch Operator


limit: -1



常规来讲,这个SQL非常简单,a表是主表,与其他表左外关联用到了k1和k2两个关联键,使用两个MapReduce作业完全可以搞定。但是这个SQL的执行计划却给出了3个作业:(Stage-0用做数据的最终展示,该作业可以忽略不计)第1个作业(Stage-5)是a表与b表关联;第2个作业(Stage-1)是第1个作业的中间结果再与c、d、f三表关联;第3个作业(Stage-2)是第2个作业的中间结果再与e表关联。


相关阅读