设为首页 加入收藏

TOP

hive排序特性研究(十)
2014-11-24 07:25:19 来源: 作者: 【 】 浏览:7
Tags:hive 排序 特性 研究
: 1 Reduce: 2 Cumulative CPU: 4.34 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 340 msec
OK
Time taken: 17.637 seconds
写入文件后的查询结果:
结果分析:采用hash算法,根据hash值将查询的结果写入不同的reduce文件中。
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/tmp/hivetest/clusterby' select id,devid,job_time from tb_in_base where job_time=030729 cluster by job_time;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201307151509_15533, Tracking URL = http://mwtec-50:50030/jobdetails.jsp jobid=job_201307151509_15533
Kill Command = /home/hadoop/hadoop-0.20.2/bin/hadoop job -Dmapred.job.tracker=mwtec-50:9002 -kill job_201307151509_15533
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2013-08-05 19:43:38,722 Stage-1 map = 0%, reduce = 0%
2013-08-05 19:43:40,732 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:41,738 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:42,743 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:43,748 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:44,754 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:45,759 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:46,765 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.5 sec
2013-08-05 19:43:47,770 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 1.5 sec
2013-08-05 19:43:48,776 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 3.02 sec
2013-08-05 19:43:49,781 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.55 sec
2013-08-05 19:43:50,787 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.55 sec
MapReduce Total cumulative CPU time: 4 seconds 550 msec
Ended Job = job_201307151509_15533
Copying data to local directory /tmp/hivetest/clusterby
Copying data to local directory /tmp/hivetest/clusterby
7 Rows loaded to /tmp/hivetest/clusterby
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 2 Cumulative CPU: 4.55 sec HDFS Read: 458 HDFS Write: 112 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 550 msec
OK
Time taken: 16.613 seconds
写入文件后的查询结果:
结果说明:cluster by 对其指定的列,进行hash,若hash值相同则被分配到相同的reduce文件中。
3 . 总结分析
1). order by 只有一个reduce负责对所有的数据进行排序,若大数据量,则需要较长的时间。建议在小的数据集中使用order by 进行排序。
2). order by 可以通过设置hive.mapred.mode参数控制执行方式,若选择strict,则order by 则需要指定limit(若有分区还有指定哪个分区) ;若为nostrict,则与关系型数据库差不多。
3). sort by 基本上不受hive.mapred.mode影响,可以通过mapred.reduce.task 指定reduce个数,查询后的数据被分发到相关的reduce中。
4). sort by 的数据在进入reduce前就完成排序,如果要使用sort by 是行排序,并且设置map.reduce.tasks>1,则sort by 才能保证每个reducer输出有序,不能保证全局数据有序。
5). distribute by 采集hash算法,在map端将查询的结果中hash值相同的结果分发到对应的reduce文件中。
6). distribute by 可以使用length方法会根据string类型的长度划分到不同的reduce中,最终输出到不同的文件中。 length 是内建函数,也可以指定其他的函数或这使用自定义函数。
7). cluster by 除了distribute by 的功能外,还会对该字段进行排
首页 上一页 7 8 9 10 下一页 尾页 10/10/10
】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
分享到: 
上一篇Redis批量导入数据 下一篇如何优化单表大批量数据提取插入..

评论

帐  号: 密码: (新用户注册)
验 证 码:
表  情:
内  容:

·Libevent C++ 高并发 (2025-12-26 00:49:30)
·C++ dll 设计接口时 (2025-12-26 00:49:28)
·透彻理解 C 语言指针 (2025-12-26 00:22:52)
·C语言指针详解 (经典 (2025-12-26 00:22:49)
·C 指针 | 菜鸟教程 (2025-12-26 00:22:46)