hiveSQL优化之distributeby和sortby(一)

2014-11-24 14:47:40 · 作者: · 浏览: 0
最近在优化hiveSQL,

下面是一段排序,分组后取每组第一行记录的SQL

  1. INSERT OVERWRITE TABLE t_wa_funnel_distinct_temp PARTITION (pt='${SRCTIME}')
  2. SELECT
  3. bussiness_id,
  4. cookie_id,
  5. session_id,
  6. funnel_id,
  7. group_first(funnel_name) funnel_name,
  8. step_id,
  9. group_first(step_name) step_name,
  10. group_first(log_type) log_type,
  11. group_first(url_pattern) url_pattern,
  12. group_first(url) url,
  13. group_first(refer) refer,
  14. group_first(log_time) log_time,
  15. group_first(is_new_visitor) is_new_visitor,
  16. group_first(is_mobile_traffic) is_mobile_traffic,
  17. group_first(is_bounce) is_bounce,
  18. group_first(campaign_name) campaign_name,
  19. group_first(group_name) group_name,
  20. group_first(slot_name) slot_name,
  21. group_first(source_type) source_type,
  22. group_first(next_page) next_page,
  23. group_first(continent) continent,
  24. group_first(sub_continent_region) sub_continent_region,
  25. group_first(country) country,
  26. group_first(region) region,
  27. group_first(city) city,
  28. group_first(language) language,
  29. group_first(browser) browser,
  30. group_first(os) os,
  31. group_first(screen_color) screen_color,
  32. group_first(screen_resolution) screen_resolution,
  33. group_first(flash_version) flash_version,
  34. group_first(java) java,
  35. group_first(host) host
  36. FROM
  37. ( SELECT *
  38. FROM r_wa_funnel
  39. WHERE pt='${SRCTIME}'
  40. ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC
  41. ) t1
  42. GROUP BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id;

    group_first: 自定义函数,用户取每组第一个字段
    ${SRCTIME}: 由外部oozie调度传入, 作为时间分区,精确到小时.eg: 2011.11.01.21

    下面在hive上以SRCTIME = 2011.11.01.21 执行以上SQL. 2011.11.01.21小时分区记录数有10435486

    执行时间:

    从上面可以看出,reduce阶段只有一个reduce, 这是因为ORDER BY是全局排序,hive只能通过一个reduce进行排序
    从业务需求来看, 只要按bussiness_id, cookie_id, session_id, funnel_id, step_id分组,组内按
    log_time升序排序即可.

    OK, 这样可以采用hive提供的distribute by 和 sort by,这样可以充分利用hadoop资源, 在多个
    reduce中局部按log_time 排序

    优化有的hive代码:

    1. INSERT OVERWRITE TABLE t_wa_funnel_distinct PARTITION (pt='2011.11.01.21')
    2. SELECT
    3. bussiness_id,
    4. cookie_id,
    5. session_id,
    6. funnel_id,
    7. group_first(funnel_name) funnel_name,
    8. step_id,
    9. group_first(step_name) step_name,
    10. group_first(log_type) log_type,
    11. group_first(url_pattern) url_pattern,
    12. group_first(url) url,
    13. group_first(refer) refer,
    14. group_first(log_time) log_time,
    15. group_first(is_new_visitor) is_new_visitor,
    16. group_first(is_mobile_traffic) is_mobile_traffic,
    17. group_first(is_bounce) is_bounce,
    18. group_first(campaign_name) campaign_name,
    19. group_first(group_name) group_name,
    20. group_first(slot_name) slot_name,
    21. group_first(source_type) source_type,
    22. group_first(next_page) next_page,
    23. group_first(continent) continent,
    24. group_first(sub_continent_region) sub_continent_region,
    25. group_first(country) country,
    26. group_first(region) region,
    27. group_first(city) city,
    28. group_first(language) language,
    29. group_first(browser) browser,
    30. group_first(os) os,
    31. group_first(screen_color) screen_color,
    32. group_first(screen_resolution) screen_resolution,
    33. group_first(flash_version) flash_version,
    34. group_first(java) java,
    35. group_first(host) host
    36. FROM
    37. ( SELECT *
    38. FROM r_wa_funnel
    39. WHERE pt='2011.11.01.21'
    40. distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC
    41. ) t1
    42. GROUP BY bussiness_id, cookie_id, session_id, funnel_id, step_id;

      执行时间:

      第一个需要执行6:43, 而优化有只要执行0:35秒,性能得到大幅