设为首页 加入收藏

TOP

hive排序特性研究(六)
2014-11-24 07:25:19 来源: 作者: 【 】 浏览:8
Tags:hive 排序 特性 研究
结果说明:sort by 可以通过mapred.reduce.task指定reduce个数,可以将查询后的数据分别写入两个不同的reduce文件中。
---set mapred.reduce.tasks=2,使用order by
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/orderby' select id,devid,job_time from tb_in_base where job_time=030729 order by devid;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1 //注:说明只有一个reducer
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201307151509_15469, Tracking URL = http://mwtec-50:50030/jobdetails.jsp jobid=job_201307151509_15469
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15469
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-08-05 18:14:22,528 Stage-1 map = 0%, reduce = 0%
2013-08-05 18:14:24,538 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:25,548 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:26,562 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:27,568 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:28,574 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:29,581 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:30,587 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
2013-08-05 18:14:31,592 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 1.05 sec
2013-08-05 18:14:32,598 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.56 sec
2013-08-05 18:14:33,604 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.56 sec
2013-08-05 18:14:34,611 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.56 sec
MapReduce Total cumulative CPU time: 2 seconds 560 msec
Ended Job = job_201307151509_15469
Copying data to local directory /tmp/hivetest/orderby
Copying data to local directory /tmp/hivetest/orderby
7 Rows loaded to /tmp/hivetest/orderby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.56 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 560 msec
OK
Time taken: 16.804 seconds
查看/tmp/hivetest/orderby下查询结果
结果说明:设置mapred.reduce.task=2,使用order by做排序,所有的数据都是写入同一个文件中。由此可说明不论设计多少个reduce任务数,order by 只使用一个reduce。
注:可以用limit 子句大大减少数据量,使用limit n ,后,传输到reduce商的数据记录就减少到n* map,否则可能由于数据过大可能出不了结果。
疑问:使用limit n之后,数据是减少了,但是对于统计数据的话,少了数据的统计还有意义吗?
2.3 distribute by
distribute by 是控制在map端如何拆分给reduce端。根据distribute by 后面的列及reduce个数进行数据分发,默认采用hash算法。distribute可以使用length方法会根据string类型的长度划分到不同的reduce中,最终输出到不同的文件中。 length 是内建函数,也可以指定其他的函数或这使用自定义函数。
注:对distribute by 进行测试,一定要分配多个reduce进行处理,否则无法看到distribute by 效果。
--使用id进行排序且set mapred.reduce.tasks=2;
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/distributeby' select id,devid,job_time from tb_in_base where job_time=030729 and id<8 distribute by id ;
Total MapReduce jobs = 1
Launching Job 1 out of 1
首页 上一页 3 4 5 6 7 8 9 下一页 尾页 6/10/10
】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
分享到: 
上一篇Redis批量导入数据 下一篇如何优化单表大批量数据提取插入..

评论

帐  号: 密码: (新用户注册)
验 证 码:
表  情:
内  容:

·Libevent C++ 高并发 (2025-12-26 00:49:30)
·C++ dll 设计接口时 (2025-12-26 00:49:28)
·透彻理解 C 语言指针 (2025-12-26 00:22:52)
·C语言指针详解 (经典 (2025-12-26 00:22:49)
·C 指针 | 菜鸟教程 (2025-12-26 00:22:46)