Hive中表的关联顺序对生成MapReduce作业数的影响案例 - 数据库编程

d: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Stage: Stage-2

Map Reduce

Alias -> Map Operator Tree:

$INTNAME

Reduce Output Operator

key expressions:

expr: _col3

type: string

sort order: +

Map-reduce partition columns:

expr: _col3

type: string

tag: 0

value expressions:

expr: _col4

type: int

TableScan

alias: e

Reduce Output Operator

key expressions:

expr: k2

type: string

sort order: +

Map-reduce partition columns:

expr: k2

type: string

tag: 1

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

condition expressions:

0 {VALUE._col10}

handleSkewJoin: false

outputColumnNames: _col10

Select Operator

expressions:

expr: _col10

type: int

outputColumnNames: _col0

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

常规来讲,这个SQL非常简单,a表是主表，与其他表左外关联用到了k1和k2两个关联键，使用两个MapReduce作业完全可以搞定。但是这个SQL的执行计划却给出了3个作业：（Stage-0用做数据的最终展示，该作业可以忽略不计）第1个作业（Stage-5）是a表与b表关联；第2个作业（Stage-1）是第1个作业的中间结果再与c、d、f三表关联；第3个作业（Stage-2）是第2个作业的中间结果再与e表关联。

相关阅读：

Hive中表的关联顺序对生成MapReduce作业数的影响案例(二)