hive 加载数据后在HDFS上的文件及其整合 - Hdfs

TOP

hive 加载数据后在HDFS上的文件及其整合

2018-12-12 00:20:21 【大中小】浏览:108次

建一个表，没有任何数据，在hdfs 上也没有对应的数据文件

hive> select * from product;
OK
id name
Time taken: 0.104 seconds
hive> dfs -ls /user/hive/warehouse/psi.db/product;
hive>

从本地加载一个文件到该表：

hive> load data local inpath 'product.txt' into table product;
Copying data from file:/root/product.txt
Copying file: file:/root/product.txt
Loading data to table psi.product
Table psi.product stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 25, raw_data_size: 0]
OK
Time taken: 0.398 seconds

查看数据：
hive> select * from product;
OK
id name
1 apple
2 samsung
3 moto
Time taken: 0.124 seconds, Fetched: 3 row(s)

查看HDFS，其实是吧文件从本地原原本本copy到hdfs下而已
hive> dfs -ls /user/hive/warehouse/psi.db/product;
Found 1 items
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:21 /user/hive/warehouse/psi.db/product/product.txt
hive>

那么在load一次：

hive> load data local inpath 'product.txt' into table product;
Copying data from file:/root/product.txt
Copying file: file:/root/product.txt
Loading data to table psi.product
Table psi.product stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 50, raw_data_size: 0]
OK
Time taken: 0.346 seconds
hive>

再查看HDFS: 其实不是想象的把数据追加到原来的文件中而是又产生一个新文件product_copy_1.txt

hive> dfs -ls /user/hive/warehouse/psi.db/product;
Found 2 items
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:21 /user/hive/warehouse/psi.db/product/product.txt
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:23 /user/hive/warehouse/psi.db/product/product_copy_1.txt
hive>

把本地文件上传到HDFS再从HDFS加载试一下：

hive> dfs -put /root/product.txt /usr;
hive> load data inpath '/usr/product.txt' into table product;
Loading data to table psi.product
Table psi.product stats: [num_partitions: 0, num_files: 3, num_rows: 0, total_size: 75, raw_data_size: 0]
OK
Time taken: 0.337 seconds
hive>

再看下HDFS: 同样是吧文件拷贝到表的目录下而已

hive> dfs -ls /user/hive/warehouse/psi.db/product;
Found 3 items
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:21 /user/hive/warehouse/psi.db/product/product.txt
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:23 /user/hive/warehouse/psi.db/product/product_copy_1.txt
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:27 /user/hive/warehouse/psi.db/product/product_copy_2.txt
hive>

再来看用insert into
先复制一个表：

hive> create table product2 as select * from product;

hive> insert into table product select * from product2;

再来看HDFS: 多了一个000000_0的文件

hive> dfs -ls /user/hive/warehouse/psi.db/product;
Found 4 items
-rw-r--r-- 1 root supergroup 75 2013-09-05 02:30 /user/hive/warehouse/psi.db/product/000000_0
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:21 /user/hive/warehouse/psi.db/product/product.txt
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:23 /user/hive/warehouse/psi.db/product/product_copy_1.txt
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:27 /user/hive/warehouse/psi.db/product/product_copy_2.txt
hive>

再执行一次：
hive> insert into table product select * from product2;

再来看HDFS: 又产生了一个新文件000000_0_copy_1

hive> dfs -ls /user/hive/warehouse/psi.db/product;
Found 5 items
-rw-r--r-- 1 root supergroup 75 2013-09-05 02:30 /user/hive/warehouse/psi.db/product/000000_0
-rw-r--r-- 1 root supergroup 75 2013-09-05 02:31 /user/hive/warehouse/psi.db/product/000000_0_copy_1
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:21 /user/hive/warehouse/psi.db/product/product.txt
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:23 /user/hive/warehouse/psi.db/product/product_copy_1.txt
-rw-r--r-- 1 root supergroup 25 2013-09-05 02:27 /user/hive/warehouse/psi.db/product/product_copy_2.txt

这就意味着hive的每次加载文件或者数据之后都会在hdfs上创立一个文件，然后把表指向这些文件而已；如果在做ETL很长时间之后，这些文件的量是可观的，对namenode的压力也是显而易见的，那么如何定期去整理这些文件整合整一个？可以新建一张表

hive> create table temp as select * from product;

查看该表在HDFS上的存储：可以发现其实并不是把product表下的所有文具文件拷贝而是整合了，这样子就好办多了

hive> dfs -ls /user/hive/warehouse/psi.db/temp;
Found 1 items
-rw-r--r-- 1 root supergroup 225 2013-09-05 02:36 /user/hive/warehouse/psi.db/temp/000000_0

删除所有product的数据：

hive> truncate table product;
OK
Time taken: 1.24 seconds
hive> dfs -ls /user/hive/warehouse/psi.db/product;
hive>

在hdfs上已经没有了数据文件，然后再把temp表的数据整合到product然后删除temp

hive> insert into table product select * from temp;
hive> drop table temp;
OK
Time taken: 0.96 seconds
hive>

hive> dfs -ls /user/hive/warehouse/psi.db/product;
Found 1 items
-rw-r--r-- 1 root supergroup 225 2013-09-05 02:41 /user/hive/warehouse/psi.db/product/000000_0
hive>

这样就达到了整合的目的，当然如果原表有分区而且分区是规范的，比如说按照年份来分区则2012年的数据不会跑到2013年的文件中，这样子的话从temp表插灰原表的话自然就可以用自动分区的方式，但是如果由于其他的目的在原表中的数据在分区中已经混乱的话自然无法达到原来同样的混乱的数据

当然用hadoop的归档是一个很好的选择


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：Hadoop-->HDFS原理总结	下一篇：java hadoop hdfs 上写文..