大数据学习16：Hive环境搭建及内部表、分区表 - Hive

TOP

大数据学习16：Hive环境搭建及内部表、分区表

2019-04-15 00:54:47 【大中小】浏览:55次

大数据分析引擎：Hive

大数据的终极目标：使用SQL语句处理大数据
1、Hadoop的终极目标：使用SQL语句来处理大数据
（）Hive：支持SQL；
（）Pig：支持PigLatin
2、Spark的体系架构中：
（*）Spark SQL：类似Hive
支持SQL、支持DSL语句
3、另一个impala

一、什么是Hive

	1、Hive是基于HDFS之上的一个数据仓库
				Hive					HDFS
				表						目录
				数据					文件
				分区					目录
				桶表					文件	
				
	2、Hive基于Hadoop之上的一个数据分析引擎
				Hive是一个翻译器
				Hive 2.x以前：SQL---->Hive------>Map Reduce
				Hive2.x以后：推荐使用Spark作为SQL的执行引擎（只针对Hadoop 3.x以前）
				推荐《Hive on Spark文档》
				
	3.Hive支持SQL的一个子集（HQL）

二、Hive体系架构

三、安装配置Hive

			tar -zxvf apache-hive-2.3.0-bin.tar.gz -C ~/training/
			设置环境变量 
			HIVE_HOME=/root/training/apache-hive-2.3.0-bin
			export HIVE_HOME
	
			PATH=$HIVE_HOME/bin:$PATH
			export PATH
	核心配置文件: conf/hive-site.xml

1、嵌入模式

	(*)不需要MySQL的支持，使用Hive的自带的数据库Derby
	(*)局限：只支持一个连接
	
		<xml version="1.0" encoding="UTF-8" standalone="no">
		<xml-stylesheet type="text/xsl" href="configuration.xsl">
		<configuration>
		<property>
		   <name>javax.jdo.option.ConnectionURL</name>
		   <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
		</property>

		<property>
		   <name>javax.jdo.option.ConnectionDriverName</name>
		   <value>org.apache.derby.jdbc.EmbeddedDriver</value>
		</property>

		<property>
		   <name>hive.metastore.local</name>
		   <value>true</value>
		</property>

		<property>
		   <name>hive.metastore.warehouse.dir</name>
		   <value>file:///root/training/apache-hive-2.3.0-bin/warehouse</value>
		</property>

		</configuration>
		
		初始化Derby数据库
		schematool -dbType derby -initSchema
		
		日志
		Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

2、本地模式、远程模式：需要MySQL

	（*）MySQL的客户端： mysql front  http://www.mysqlfront.de/
	（*）重新创建hive-site.xml
			<xml version="1.0" encoding="UTF-8" standalone="no">
			<xml-stylesheet type="text/xsl" href="configuration.xsl">
			<configuration>
			<property>
			   <name>javax.jdo.option.ConnectionURL</name>
			   <value>jdbc:mysql://localhost:3306/hiveuseSSL=false</value>
			</property>

			<property>
			   <name>javax.jdo.option.ConnectionDriverName</name>
			   <value>com.mysql.jdbc.Driver</value>
			</property>

			<property>
			   <name>javax.jdo.option.ConnectionUserName</name>
			   <value>hiveowner</value>
			</property>

			<property>
			   <name>javax.jdo.option.ConnectionPassword</name>
			   <value>Welcome_1</value>
			</property>

			</configuration>
			
	（*）将mysql的jar包放到lib目录下
	      注意一定要使用高版本的MySQL驱动（5.1.43以上的版本）
（*）初始化MySQL	
			（*）老版本：当第一次启动HIve的时候 自动进行初始化
			（*）新版本：schematool -dbType mysql -initSchema
					Starting metastore schema initialization to 2.3.0
					Initialization script hive-schema-2.3.0.mysql.sql
					Initialization script completed
					schemaTool completed

四、Hive数据模型

测试数据：员工表和部门表
7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30

1、内部表：相当于MySQL的表  对应的HDFS的目录  /user/hive/warehouse
		create table emp
		(empno int,
		ename string,
		job string,
		mgr int,
		hiredate string,
		sal int,
		comm int,
		deptno int);
		
	插入数据 insert、load语句（从某个HDFS的目录或者本地Linux的目录上，把数据导入Hive的表 本质ctrl+x）
	load data inpath '/scott/emp.csv' into table emp;    导入HDFS的数据
	load data local inpath '/root/temp/*****' into table emp;    导入本地Linux的数据
	
	客户端的静默模式：不打印日志（hive -S）(Silent)
	
	创建表的时候，一定指定分隔符
		create table emp1
		(empno int,
		ename string,
		job string,
		mgr int,
		hiredate string,
		sal int,
		comm int,
		deptno int)
		row format delimited fields terminated by ',';		

	创建部门表  并且导入数据
		create table dept
		(deptno int,
		dname string,
		loc string)
		row format delimited fields terminated by ',';	

2、分区表: 可以提高查询的效率的----> 通过查看SQL的执行计划
	*根据员工的部门号创建分区*  partitioned by (deptno int)
		create table emp_part
		(empno int,
		ename string,
		job string,
		mgr int,
		hiredate string,
		sal int,
		comm int)
		partitioned by (deptno int)
		row format delimited fields terminated by ',';

Oracle中SQL的执行计划

查询10号部门的员工
explain plan from select * from emp where deptno-10；
select * from table(dbms)

Hive执行计划

SQL的执行计划

以Oracle数据库为例

补充点知识：Oracle数据库的优化器有两种：RBO、CBO  
                                        R是Rule  规则  
                                        C就是cost 成本

举例：查询10号部门的员工信息
explain plan for select * from emp where deptno=10;   -----> 生成该语句的执行计划
select * from table(dbms_xplan.display);

情况1：没有索引
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     3 |   261 |     3   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| EMP  |     3 |   261 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------

情况2：有索引 create index myindex on emp(deptno
---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |     3 |   261 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| EMP     |     3 |   261 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | MYINDEX |     3 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Hive的执行计划

1、查询普通的内部表
    explain select * from emp1 where deptno=10;
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: emp1
          Statistics: Num rows: 1 Data size: 617 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: (deptno = 10) (type: boolean)
            Statistics: Num rows: 1 Data size: 617 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: int), comm (type: int), 10 (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
              Statistics: Num rows: 1 Data size: 617 Basic stats: COMPLETE Column stats: NONE
              ListSink

2、查询分区表
   explain select * from emp_part where deptno=10;
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: emp_part
          Statistics: Num rows: 3 Data size: 121 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: int), comm (type: int), 10 (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 3 Data size: 121 Basic stats: COMPLETE Column stats: NONE
            ListSink

查询10号部门员工

	情况一：查询内部表
	select * from empl where deptno=10;
	
	情况二：查询分区表


	指明导入的数据的分区（通过子查询导入数据） ----> MapReduce程序
		insert into table emp_part partition(deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=10;
		insert into table emp_part partition(deptno=20) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=20;
		insert into table emp_part partition(deptno=30) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=30;

3、外部表
4、桶表
5、视图：view 虚表
(1) 视图不存数据视图依赖的表叫基表
(2) 操作视图跟操作表一样
(3) 视图可以提高查询的效率吗？
不可以、视图是简化复杂的查询

	(4) 举例  查询员工信息：部门名称 员工姓名
			create view myview
			as	
			select dept.dname,emp1.ename
			from emp1,dept
			where emp1.deptno=dept.deptno;
			
	(5) 补充：如果视图可以缓存数据 ------> 物化视图（可以缓存数据）

五、执行Hive的查询：执行SQL（HQL）

注意：HQL是SQL的子集

六、使用JDBC查询Hive

七、Hive自定义函数：本质是Java程序


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：hive:Access denied for user ..	下一篇：hive:Access denied for user ..