HBase 与Hive数据交互整合过程详解 - Hive

Hive和Hbase整合理论

1、为什么hive要和hbase整合

2、整合的优缺点

优点：

(1).Hive方便地提供了Hive QL的接口来简化MapReduce的使用，

而HBase提供了低延迟的数据库访问。如果两者结合，可以利

用MapReduce的优势针对HBase存储的大量内容进行离线的计算和分析。

(2).操作方便，hive提供了大量系统功能

缺点：

性能的损失，hive有这样的功能, 他支持通过类似sql语句的语法来操作hbase

中的数据, 但是速度慢。

3、整合需要做什么样的准备工作

4、整合后的目标

(1). 在hive中创建的表能直接创建保存到hbase中。

(2). 往hive中的表插入数据，数据会同步更新到hbase对应的表中。

(3). hbase对应的列簇值变更，也会在Hive中对应的表中变更。

(4). 实现了多列，多列簇的转化：（示例：hive中3列对应hbase中2列簇）

5、hive和Hbase整合后如果通信？

查看hive和Hbase通信图：

主要是通过hive 的lib目录下的hive-hbase-handler-1.2.1.jar来实现hive

和Hbase通信。

整合过程(案例操作)

在hive中创建的表的数据直接保存在hbase中。

第一: 首先启动hive.进入交互式界面，然后创建表。

hive版本: apache-hive-1.2.1

hbase版本:apache-hbase-1.1.2

hadoop版本: hadoop-2.7.3

第一: 创建hbase能识别的表。

建表语句:

create table if not exists hive_hbase(

id int,

name String,

age int,

sex String,

address String

)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf_info:eName,cf_info:eAge,cf_info:eSex,cf_beizhu:eAddress")

TBLPROPERTIES ("hbase.table.name" = "ns2:hive_hbase01");

注意: 此处的org.apache.hadoop.hive.hbase.HBaseStorageHandler 类是hive的lib包下的，需要替换成.hive-1.2.1版本的jar包。否则会报错提示找不到这个类。

错误提示:FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)V

也不能hive版本过高。比如2.x版本会报错

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

要确保hive目录的lib目录下有mysql-connector的数据库包。否则也会报错。

创建后可以在hbase中查看一下表。list

第二:

自己准备测试数据。此处省略

create table test(

id int,

name string)

row format delimited fields terminated by ','

lines terminated by '\n'

stored as textfile;

加载数据到表中:

load data local inpath '/usr/local/test01.txt' overwrite into table test;

通过结果集的方式插入数据到表中

insert overwrite table hive_hbase select * from test;

此处会跑mapreduce程序。过程省略。

第三: 在hbase中查询插入的数据

select * from hive_hbase;

20170616,zhangshaoqi,22,nan,jincheng

20170617,xuqianya,29,nv,beijing

20170618,xiaolin,29,nv,jincheng

20170619,xiaopan,33,nan,guizhou

20170620,xiaohu,26,nan,shouzhou

1 row(s) in 3.19 seconds
第四:在hbase中扫描这个表，查看是否有数据

scan 'ns2:hive_hbase01'

第五:hive访问已经存在的hbase

需要使用external 类型的外部表，否则会报错

REATE EXTERNAL TABLE hbase_table_3(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")
TBLPROPERTIES("hbase.table.name" = "student");
hive> CREATE EXTERNAL TABLE hbase_table_3(key int, value string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")
> TBLPROPERTIES("hbase.table.name" = "student");
OK
Time taken: 1.21 seconds

注意：如果hbase中列簇名name数据变更，那么hive中查询结果也会相应的变更，如果hbase中不是其他列簇
内容更新则hive中查询结果不显示

就这些了，有问题欢迎讨论

本文转自 ChinaUnicom110 51CTO博客，原文链接:http://blog.51cto.com/xingyue2011/1939096