TOP

Hive Hql基本语法全攻略

2018-12-03 09:17:31 【大中小】浏览:40次

Tags：Hive Hql 基本语法全攻略

Hive官网（HQL）语法手册（英文版）：

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

一、Hive的数据存储

　　1、Hive中所有的数据都存储在HDFS 中，没有专门的数据存储格式（可支持Text，SequenceFile，ParquetFile，RCFILE等）

　　2、只需要在创建表的时候告诉Hive 数据中的列分隔符和行分隔符，Hive 就可以解析数据。

　　3、Hive 中包含以下数据模型：DB、Table，External Table，Partition，Bucket。

　　　（1）：db：在hdfs中表现为${hive.metastore.warehouse.dir}目录下一个文件夹

　　　　（2）：table：在hdfs中表现所属db目录下一个文件夹

　　　　（3）：external table：外部表, 与table类似，不过其数据存放位置可以在任意指定路径

　　　　　　　　普通表: 删除表后, hdfs上的文件都删了

　　　　　　　　External外部表删除后, hdfs上的文件没有删除, 只是把文件删除了

　　　　（4）：partition：在hdfs中表现为table目录下的子目录

　　　　（5）：bucket：桶, 在hdfs中表现为同一个表目录下根据hash散列之后的多个文件, 会根据不同的文件把数据放到不同的文件中

二、Hive创建数据库操作

#创建：
create (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES] (property_name=value,name=value...)

#显示描述信息：
describe DATABASE|SCHEMA [extended] database_name。

#删除：
DROP DATABASE|SHCEMA [IF EXISTS] database_Name [RESTRICT|CASCADE]

#使用：
user database_name;

1、Hive创建数据表

（1）创建表(DDL操作)

建表语法如下所示：

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)] ----指定表的名称和表的具体列信息。

[COMMENT table_comment] ---表的描述信息。

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] ---表的分区信息。

[CLUSTERED BY (col_name, col_name, ...)

[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] ---表的桶信息。

[ROW FORMAT row_format] ---表的数据分割信息，格式化信息。

[STORED AS file_format] ---表数据的存储序列化信息。

[LOCATION hdfs_path] ---数据存储的文件夹地址信息。

创建数据表解释说明：

1、CREATE TABLE创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用IF NOT EXISTS选项来忽略这个异常。hive中的表可以分为内部表（托管表）和外部表，区别在于，外部表的数据不是有hive进行管理的，也就是说当删除外部表的时候，外部表的数据不会从hdfs中删除。而内部表是由hive进行管理的，在删除表的时候，数据也会删除。一般情况下，我们在创建外部表的时候会将表数据的存储路径定义在hive的数据仓库路径之外。hive创建表主要有三种方式，第一种直接使用create table命令，第二种使用create table ... as select...(会产生数据)。第三种使用create table tablename like exist_tablename命令。

2、EXTERNAL关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

3、LIKE 允许用户复制现有的表结构，但是不复制数据。

4、ROW FORMAT

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED，将会使用自带的 SerDe。在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的 SerDe，Hive通过 SerDe 确定表的具体的列的数据。

5、STORED AS

SEQUENCEFILE| TEXTFILE |RCFILE

如果文件数据是纯文本，可以使用STORED AS TEXTFILE。如果数据需要压缩，使用STORED AS SEQUENCEFILE。

6、CLUSTERED BY

对于每一个表（table）或者分区， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

把表（或者分区）组织成桶（Bucket）有两个理由：

（1）获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大较少JOIN的数据量。

（2）使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便。

7、create table命令介绍2

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name] table_name LIKE existing_table_orview_name ---指定要创建的表和已经存在的表或者视图的名称。

[LOCATION hdfs_path] ---数据文件存储的hdfs文件地址信息。

8、CREATE [EXTERNAL] TABLE [IF NOT EXISTS]

[db_Name] table_name ---指定要创建的表名称

...指定partition&bucket等信息，指定数据分割符号。

[AS select_statement] ---导入的数据

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE;

2、Hive创建数据表解释

# page_view是数据表的名称，注意hive的数据类型和java的数据类型类似，和mysql和oracle等数据库的字段类型不一致。
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')

#COMMENT描述，可有可无的。
 COMMENT 'This is the page view table'

# PARTITIONED BY指定表的分区，可以先不管。
 PARTITIONED BY(dt STRING, country STRING)

# ROW FORMAT DELIMITED代表一行是一条记录，是自己创建的全部字段和文件的字段对应，一行对应一条记录。
 ROW FORMAT DELIMITED

#FIELDS TERMINATED BY '\001'代表一行记录中的各个字段以什么隔开，方便创建的数据字段对应文件的一条记录的字段。
   FIELDS TERMINATED BY '\001'

# STORED AS SEQUENCEFILE;代表对应的文件类型。最常见的是SEQUENCEFILE（以键值对类型格式存储的）类型。TEXTFILE类型。
STORED AS SEQUENCEFILE;

3、Hive创建数据表之后，load数据

//create & load（创建好数据表以后导入数据的操作如）：
hive> create table tb_order(id int,name string,memory string,price double)
> row format delimited
> fields terminated by '\t';

//从本地导入数据到hive的表中（实质就是将文件上传到hdfs中hive管理目录下）
load data local inpath '/home/hadoop/ip.txt' into table 要导入的表名称；

//从hdfs上导入数据到hive表中（实质就是将文件从原始目录移动到hive管理的目录下）

load data inpath 'hdfs://ns1/aa/bb/data.log' into table要导入的表名称;

//使用select语句来批量插入数据

insert overwrite table tab_ip_seq select * from要导入的表名称;

hive> load data local inpath '/home/hadoop/hivetest/phoneorder.data' into table tb_order;

或者

[root@slaver3 hivetest]# hadoop fs -put phoneorder2.data /user/hive/warehouse/tb_order

4、Hive查询，统计

5、External 外部表，优点：

做数据分析的时候，有的数据是业务系统产生的，或者读或者写这个文件，如果的默认的路径，即在配置文件里面写好了，如果做分析的时候数据表导数据，如果将数据表移动了，，业务系统再读这个文件就不存在了，这个时候使用外部表，外部表不要求数据非到默认的路径下面去，数据可以摆放到任意的hdfs路径下面；

创建外部表的语法：

//external外部表
CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
     ip STRING,
     country STRING)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS TEXTFILE
 LOCATION '/external/user';

//external外部表
//使用关键字EXTERNAL 
CREATE EXTERNAL TABLE 数据表名称(id int, name string,
     ip STRING,
     country STRING)
 ROW FORMAT DELIMITED 
 FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE

#location指定所在的位置：切记，重点。
 LOCATION '/external/user';

6、创建分区表（分区的好处是可以帮助你统计的时候少统计一些数据，加速数据统计）：

hive> create table tb_part(sNo int,sName string,sAge int,sDept string)
    > partition //拿不准的单词，可以tab一下进行提示，并不会影响你创建表；谢谢
partition     partitioned   partitions    
    > partitioned by (part string)
    > row format delimited  
    > fields terminated by ','
    > stored as textfile;
OK
Time taken: 0.351 seconds

并且将本地的数据上传到hive上面：

 1 hive> load data local inpath '/home/hadoop/data_hadoop/tb_part' overwrite into table tb_part partition (part='20171210');
 2 Loading data to table test.tb_part partition (part=20171210)
 3 Partition test.tb_part{part=20171210} stats: [numFiles=1, numRows=0, totalSize=43, rawDataSize=0]
 4 OK
 5 Time taken: 2.984 seconds
 6 hive> load data local inpath '/home/hadoop/data_hadoop/tb_part' overwrite into table tb_part partition (part='20171211');
 7 Loading data to table test.tb_part partition (part=20171211)
 8 Partition test.tb_part{part=20171211} stats: [numFiles=1, numRows=0, totalSize=43, rawDataSize=0]
 9 OK
10 Time taken: 0.566 seconds
11 hive> show par
12 parse_url(         parse_url_tuple(   partition          partitioned        partitions         
13 hive> show partition
14 partition     partitioned   partitions    
15 hive> show partitions tb_part;
16 OK
17 part=20171210
18 part=20171211
19 Time taken: 0.119 seconds, Fetched: 2 row(s)
20 hive>

7、创建带桶的数据表，然后将本地创建好测试数据上传到hive上面：

#设置变量,设置分桶为true, 设置reduce数量是分桶的数量个数
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;

 1 hive> create table if not exists tb_stud(id int,name string,age int)
 2     > partitioned by(clus string)
 3     > clustered by(id) sorted by(age) into 2 buckets  #分桶，根据id进行分桶，分成2个桶。
 4     > row format delimited 
 5     > fields terminated by ',';
 6 OK
 7 Time taken: 0.194 seconds
 8 hive> load data local inpath '/home/hadoop/data_hadoop/tb_clustered' overwrite into table tb_stud partition (clus='20171211');
 9 Loading data to table test.tb_stud partition (clus=20171211)
10 Partition test.tb_stud{clus=20171211} stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]
11 OK
12 Time taken: 0.594 seconds
13 hive>

8、Hive修改表增加/删除分区

语法结构
ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
partition_spec:
: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

ALTER TABLE table_name DROP partition_spec, partition_spec,...
具体实例如下所示：
alter table student_p add partition(part='a') partition(part='b');

修改分区和删除分区的操作：

hive> alter table tb_stud add partition(clus='20171215') location '/user/hive/warehouse/test.db' partition(clus='20171216');
OK
Time taken: 1.289 seconds
hive> alter table tb_stud add partition
partition     partitioned   partitions    
hive> alter table tb_stud add partition(clus='20171217');
OK
Time taken: 0.097 seconds
hive> dfs -ls /user/hive/warehouse/test.db
    > ;
Found 4 items
drwxr-xr-x   - root supergroup          0 2017-12-09 23:32 /user/hive/warehouse/test.db/tb_log
drwxr-xr-x   - root supergroup          0 2017-12-10 00:14 /user/hive/warehouse/test.db/tb_part
drwxr-xr-x   - root supergroup          0 2017-12-10 00:43 /user/hive/warehouse/test.db/tb_stud
drwxr-xr-x   - root supergroup          0 2017-12-09 21:28 /user/hive/warehouse/test.db/tb_user
hive> show partitions tb_stud;
OK
clus=20171211
clus=20171215
clus=20171216
clus=20171217
Time taken: 0.119 seconds, Fetched: 4 row(s)
hive> alter table tb_stud drop partition
partition     partitioned   partitions    
hive> alter table tb_stud drop partition(clus='20171217');
Dropped the partition clus=20171217
OK
Time taken: 1.433 seconds
hive> show partitions tb_stud;
OK
clus=20171211
clus=20171215
clus=20171216
Time taken: 0.092 seconds, Fetched: 3 row(s)
hive> alter table tb_stud drop partition(clus='20171215'),partition(clus='20171216');
Dropped the partition clus=20171215
Dropped the partition clus=20171216
OK
Time taken: 0.271 seconds
hive> show partitions tb_stud;
OK
clus=20171211
Time taken: 0.094 seconds, Fetched: 1 row(s)
hive>

9、Hive重命名表

语法结构
ALTER TABLE table_name RENAME TO new_table_name
具体实例，如下所示：

 1 hive> show tables;
 2 OK
 3 tb_log
 4 tb_part
 5 tb_stud
 6 tb_user
 7 Time taken: 0.026 seconds, Fetched: 4 row(s)
 8 hive> alter table tb_user rename to tb_user_copy;
 9 OK
10 Time taken: 0.19 seconds
11 hive> show tables;
12 OK
13 tb_log
14 tb_part
15 tb_stud
16 tb_user_copy
17 Time taken: 0.05 seconds, Fetched: 4 row(s)
18 hive>

10、Hive增加/更新列

语法结构
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
注：ADD是代表新增一字段，字段位置在所有列后面(partition列前)，REPLACE则是表示替换表中所有字段。
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
具体实例如下所示：

 1 hive> desc tb_user;
 2 OK
 3 id                      int                                         
 4 name                    string                                      
 5 Time taken: 0.148 seconds, Fetched: 2 row(s)
 6 hive> alter table tb_user add columns(age int);
 7 OK
 8 Time taken: 0.238 seconds
 9 hive> desc tb_user;
10 OK
11 id                      int                                         
12 name                    string                                      
13 age                     int                                         
14 Time taken: 0.088 seconds, Fetched: 3 row(s)
15 hive> alter table tb_user replace columns(id int,name string,birthday string);
16 OK
17 Time taken: 0.132 seconds
18 hive> desc tb_user;
19 OK
20 id                      int                                         
21 name                    string                                      
22 birthday                string                                      
23 Time taken: 0.083 seconds, Fetched: 3 row(s)
24 hive>

11、Hive Load操作只是单纯的复制/移动操作，DML操作

语法结构
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
说明：
1、Load 操作只是单纯的复制/移动操作，将数据文件移动到 Hive 表对应的位置。
2、filepath：
　　相对路径，例如：project/data1
　　绝对路径，例如：/user/hive/project/data1
　　包含模式的完整 URI，列如：
　　hdfs://namenode:9000/user/hive/project/data1
3、LOCAL关键字
　　如果指定了 LOCAL， load 命令会去查找本地文件系统中的 filepath。
　　如果没有指定 LOCAL 关键字，则根据inpath中的uri[如果指定了 LOCAL，那么：
　　load 命令会去查找本地文件系统中的 filepath。如果发现是相对路径，则路径会被解释为相对于当前用户的当前路径。
　　load 命令会将 filepath中的文件复制到目标文件系统中。目标文件系统由表的位置属性决定。被复制的数据文件移动到表的数据对应的位置。
　　如果没有指定 LOCAL 关键字，如果 filepath 指向的是一个完整的 URI，hive 会直接使用这个 URI。否则：如果没有指定　　schema 或者 authority，Hive 会使用在 hadoop 配置文件中定义的 schema 和 authority，fs.default.name 指定了 Namenode 的 URI。
　　如果路径不是绝对的，Hive 相对于/user/进行解释。
　　Hive 会将 filepath 中指定的文件内容移动到 table （或者 partition）所指定的路径中。]查找文件
4、OVERWRITE 关键字
如果使用了 OVERWRITE 关键字，则目标表（或者分区）中的内容会被删除，然后再将 filepath 指向的文件/目录中的内容添加到表/分区中。
如果目标表（分区）已经有一个文件，并且文件名和 filepath 中的文件名冲突，那么现有的文件会被新文件所替代。

12、Hive的insert操作

Insert
将查询结果插入Hive表
语法结构
　　INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement

Multiple inserts:
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...

Dynamic partition inserts:

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement

  1 <!--基本模式插入。-->
  2 hive> load data local inpath '/home/hadoop/data_hadoop/tb_stud' overwrite into table tb_stud partition (clus='20171211');
  3 Loading data to table test.tb_stud partition (clus=20171211)
  4 Partition test.tb_stud{clus=20171211} stats: [numFiles=1, numRows=0, totalSize=43, rawDataSize=0]
  5 OK
  6 Time taken: 4.336 seconds
  7 hive> select * from tb_stud where clus='20171211';
  8 OK
  9 1    张三    NULL    20171211
 10 2    lisi    NULL    20171211
 11 3    wangwu    NULL    20171211
 12 4    zhaoliu    NULL    20171211
 13 5    libai    NULL    20171211
 14 Time taken: 0.258 seconds, Fetched: 5 row(s)
 15 hive> insert overwrite table tb_stud partition(clus='20171218')
 16     >  select id,name,age from tb_stud where clus='20171211';
 17 Query ID = root_20171210012734_721f76d9-f670-42ad-bf68-bfb94baf5cda
 18 Total jobs = 3
 19 Launching Job 1 out of 3
 20 Number of reduce tasks is set to 0 since there's no reduce operator
 21 Starting Job = job_1512874725514_0005, Tracking URL = http://master:8088/proxy/application_1512874725514_0005/
 22 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job  -kill job_1512874725514_0005
 23 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
 24 2017-12-10 01:28:00,125 Stage-1 map = 0%,  reduce = 0%
 25 2017-12-10 01:28:34,514 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.44 sec
 26 MapReduce Total cumulative CPU time: 1 seconds 440 msec
 27 Ended Job = job_1512874725514_0005
 28 Stage-4 is selected by condition resolver.
 29 Stage-3 is filtered out by condition resolver.
 30 Stage-5 is filtered out by condition resolver.
 31 Moving data to: hdfs://master:9000/user/hive/warehouse/test.db/tb_stud/clus=20171218/.hive-staging_hive_2017-12-10_01-27-34_894_112189266881641464-1/-ext-10000
 32 Loading data to table test.tb_stud partition (clus=20171218)
 33 Partition test.tb_stud{clus=20171218} stats: [numFiles=1, numRows=5, totalSize=58, rawDataSize=53]
 34 MapReduce Jobs Launched: 
 35 Stage-Stage-1: Map: 1   Cumulative CPU: 1.44 sec   HDFS Read: 3816 HDFS Write: 140 SUCCESS
 36 Total MapReduce CPU Time Spent: 1 seconds 440 msec
 37 OK
 38 Time taken: 64.66 seconds
 39 hive> select * from tb_stud where clus='20171218';
 40 OK
 41 1    张三    NULL    20171218
 42 2    lisi    NULL    20171218
 43 3    wangwu    NULL    20171218
 44 4    zhaoliu    NULL    20171218
 45 5    libai    NULL    20171218
 46 Time taken: 0.085 seconds, Fetched: 5 row(s)
 47 hive> 
 48 
 49 <!--多插入模式。-->
 50 hive> show partitions tb_stud;
 51 OK
 52 clus=20171211
 53 clus=20171218
 54 Time taken: 0.153 seconds, Fetched: 2 row(s)
 55 hive> alter table tb_stud add partition(clus='20171212');
 56 OK
 57 Time taken: 0.143 seconds
 58 hive> alter table tb_stud add partition(clus='20171213');
 59 OK
 60 Time taken: 0.399 seconds
 61 hive> alter table tb_stud add partition(clus='20171214');
 62 OK
 63 Time taken: 0.139 seconds
 64 hive> from tb_stud
 65     > insert overwrite table tb_stud partition(clus='20171213')
 66     > select id,name,age  where clus='20171211'
 67     > insert overwrite table tb_stud partition(clus='20171214')
 68     > select id,name,age  where clus='20171211';
 69 Query ID = root_20171210013655_0c4a1d78-88e2-4de0-99ca-074c9eed81a4
 70 Total jobs = 5
 71 Launching Job 1 out of 5
 72 Number of reduce tasks is set to 0 since there's no reduce operator
 73 Starting Job = job_1512874725514_0007, Tracking URL = http://master:8088/proxy/application_1512874725514_0007/
 74 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job  -kill job_1512874725514_0007
 75 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0
 76 2017-12-10 01:37:05,501 Stage-2 map = 0%,  reduce = 0%
 77 2017-12-10 01:38:06,089 Stage-2 map = 0%,  reduce = 0%
 78 2017-12-10 01:38:08,363 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.46 sec
 79 MapReduce Total cumulative CPU time: 1 seconds 460 msec
 80 Ended Job = job_1512874725514_0007
 81 Stage-5 is selected by condition resolver.
 82 Stage-4 is filtered out by condition resolver.
 83 Stage-6 is filtered out by condition resolver.
 84 Stage-11 is selected by condition resolver.
 85 Stage-10 is filtered out by condition resolver.
 86 Stage-12 is filtered out by condition resolver.
 87 Moving data to: hdfs://master:9000/user/hive/warehouse/test.db/tb_stud/clus=20171213/.hive-staging_hive_2017-12-10_01-36-55_602_8039889333976698612-1/-ext-10000
 88 Moving data to: hdfs://master:9000/user/hive/warehouse/test.db/tb_stud/clus=20171214/.hive-staging_hive_2017-12-10_01-36-55_602_8039889333976698612-1/-ext-10002
 89 Loading data to table test.tb_stud partition (clus=20171213)
 90 Loading data to table test.tb_stud partition (clus=20171214)
 91 Partition test.tb_stud{clus=20171213} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0]
 92 Partition test.tb_stud{clus=20171214} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0]
 93 MapReduce Jobs Launched: 
 94 Stage-Stage-2: Map: 1   Cumulative CPU: 1.57 sec   HDFS Read: 4798 HDFS Write: 280 SUCCESS
 95 Total MapReduce CPU Time Spent: 1 seconds 570 msec
 96 OK
 97 Time taken: 81.536 seconds
 98 hive> select * from tb_stud where clus='20171213';
 99 OK
100 1    张三    NULL    20171213
101 2    lisi    NULL    20171213
102 3    wangwu    NULL    20171213
103 4    zhaoliu    NULL    20171213
104 5    libai    NULL    20171213
105 Time taken: 0.138 seconds, Fetched: 5 row(s)
106 hive> select * from tb_stud where clus='20171214';
107 OK
108 1    张三    NULL    20171214
109 2    lisi    NULL    20171214
110 3    wangwu    NULL    20171214
111 4    zhaoliu    NULL    20171214
112 5    libai    NULL    20171214
113 Time taken: 0.075 seconds, Fetched: 5 row(s)
114 hive> 
115 
116 <!--自动分区模式。-->

13、Hive导出表数据

语法结构
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...

multiple inserts:
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1

[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

 1 1、导出文件到本地。
 2 说明：
 3 数据写入到文件系统时进行文本序列化，且每列用^A来区分，\n为换行符。用more命令查看时不容易看出分割符，可以使用: sed -e 's/\x01/|/g' filename[]来查看。
 4 
 5 
 6 hive> insert overwrite local directory '/home/hadoop/data_hadoop/get_tb_stud'
 7     > select * from tb_stud;
 8 Query ID = root_20171210014640_4c499323-760e-4494-946b-5ffad8fb3789
 9 Total jobs = 1
10 Launching Job 1 out of 1
11 Number of reduce tasks is set to 0 since there's no reduce operator
12 Starting Job = job_1512874725514_0008, Tracking URL = http://master:8088/proxy/application_1512874725514_0008/
13 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job  -kill job_1512874725514_0008
14 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
15 2017-12-10 01:46:50,400 Stage-1 map = 0%,  reduce = 0%
16 2017-12-10 01:47:25,696 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 7.13 sec
17 MapReduce Total cumulative CPU time: 7 seconds 130 msec
18 Ended Job = job_1512874725514_0008
19 Copying data to local directory /home/hadoop/data_hadoop/get_tb_stud
20 Copying data to local directory /home/hadoop/data_hadoop/get_tb_stud
21 MapReduce Jobs Launched:
22 Stage-Stage-1: Map: 2   Cumulative CPU: 7.13 sec   HDFS Read: 10392 HDFS Write: 515 SUCCESS
23 Total MapReduce CPU Time Spent: 7 seconds 130 msec
24 OK
25 Time taken: 47.258 seconds
26 hive>
27 
28 <!--导出数据到HDFS。-->
29 hive> insert overwrite directory 'hdfs://192.168.199.130:9000/user/hive/warehouse/tb_stud_get'
30     > select * from tb_stud;
31 Query ID = root_20171210015229_b0a323b1-b1dc-4f31-b932-cb8126bac2ff
32 Total jobs = 3
33 Launching Job 1 out of 3
34 Number of reduce tasks is set to 0 since there's no reduce operator
35 Starting Job = job_1512874725514_0009, Tracking URL = http://master:8088/proxy/application_1512874725514_0009/
36 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job  -kill job_1512874725514_0009
37 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
38 2017-12-10 01:53:52,773 Stage-1 map = 0%,  reduce = 0%
39 2017-12-10 01:54:07,829 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.65 sec
40 MapReduce Total cumulative CPU time: 2 seconds 650 msec
41 Ended Job = job_1512874725514_0009
42 Stage-3 is filtered out by condition resolver.
43 Stage-2 is selected by condition resolver.
44 Stage-4 is filtered out by condition resolver.
45 Launching Job 3 out of 3
46 Number of reduce tasks is set to 0 since there's no reduce operator
47 Starting Job = job_1512874725514_0010, Tracking URL = http://master:8088/proxy/application_1512874725514_0010/
48 Kill Command = /home/hadoop/soft/hadoop-2.6.4/bin/hadoop job  -kill job_1512874725514_0010
49 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0
50 2017-12-10 01:54:55,697 Stage-2 map = 0%,  reduce = 0%
51 2017-12-10 01:55:45,607 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.43 sec
52 MapReduce Total cumulative CPU time: 1 seconds 430 msec
53 Ended Job = job_1512874725514_0010
54 Moving data to: hdfs://192.168.199.130:9000/user/hive/warehouse/tb_stud_get
55 MapReduce Jobs Launched: 
56 Stage-Stage-1: Map: 2   Cumulative CPU: 2.65 sec   HDFS Read: 10412 HDFS Write: 515 SUCCESS
57 Stage-Stage-2: Map: 1   Cumulative CPU: 1.43 sec   HDFS Read: 2313 HDFS Write: 515 SUCCESS
58 Total MapReduce CPU Time Spent: 4 seconds 80 msec
59 OK
60 Time taken: 202.508 seconds
61 hive> dfs -ls /user/hive/warehouse/tb_stud_get;
62 Found 1 items
63 -rwxr-xr-x   2 root supergroup        515 2017-12-10 01:55 /user/hive/warehouse/tb_stud_get/000000_0
64 hive>

14、Hive基本的

语法结构
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
]
[LIMIT number]

注：1、order by 会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。
2、sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。
3、distribute by根据distribute by指定的内容将数据分到同一个reducer。
4、Cluster by 除了具有Distribute by的功能外，还会对该字段进行排序。因此，常常认为cluster by = distribute by + sort by

5、分桶表的最大的意思：最大的作用是用来提高join操作的效率；

6、思考这个问题：

　　select a.id,a.name,b.addr from a join b on a.id = b.id;

　　如果a表和b表已经是分桶表，而且分桶的字段是id字段

　　做这个join操作时，还需要全表做笛卡尔积吗？答案：不需要，因为相同的id就在同一个桶里面。

15、Hive删除hive的内部表和外部表的区别：

删除内部表是将元数据（TABL表），以及hdfs上面的文件夹以及文件一起删除；

删除外部表只是删除元数据（TABL表），hdfs上面的文件夹以及文件不删除。

16、创建一个新表根据老表（用来做一些中间结果的存储，再做后一步的处理）

hive> create table tb_order_new  
    > as                         
    > select id,name,memory,price
    > from tb_order;

17、Hive insert from select 通过select语句批量插入数据到别的表（用于向临时表中追加中间结果数据）

//创建一个表
create table tab_ip_like like tab_ip;

//批量插入数据，批量插入已经存在表
insert overwrite table tab_ip_like
    select * from tab_ip;

hive> create table tb_order_append(id int,name string,memory string,price double)
    > row format delimited
    > fields terminated by '\t';

hive> insert overwrite table tb_order_append
    > select * from tb_order;


hive> select * from tb_order_append;

18、Hive PARTITION ，分区表（partition），分区统计，可以对数据操作加快速度

查询分区：hive> show partitions part;　　 #show partitions 数据表名称;
删除分区：hive> alter table part drop partition(date='20180512');
添加分区：hive> alter table part add partition(date='20180512');

hive> create table tb_order_part(id int,name string,memory string,salary double)
    > partitioned by (month string)                                             
    > row format delimited                                                      
    > fields terminated by '\t';                                                
OK
Time taken: 0.13 seconds

然后将数据导入这个新建的分区里面（所谓分区就是在文件夹下面创建一个文件夹，把数据放到这个文件夹下面），如下所示：
hive> load data local inpath '/home/hadoop/hivetest/phoneorder.data' into table tb_order_part partition(month='201401');
hive> load data local inpath '/home/hadoop/hivetest/phoneorder2.data' into table tb_order_part partition(month='201402');

可以根据分区查询一下数据：

19、Hivewrite to hdfs，将结果写入到hdfs的文件中：

语法结构
join_table:
  table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
Hive 支持等值连接（equality joins）、外连接（outer joins）和（left/right joins）。Hive 不支持非等值的连接，因为非等值连接非常难转化到 map/reduce 任务。
另外，Hive 支持多于 2 个表的连接。
写 join 查询时，需要注意几个关键点：
1. 只支持等值join
例如： 
  SELECT a.* FROM a JOIN b ON (a.id = b.id)
  SELECT a.* FROM a JOIN b
    ON (a.id = b.id AND a.department = b.department)
是正确的，然而:
  SELECT a.* FROM a JOIN b ON (a.id>b.id)
是错误的。

2. 可以 join 多于 2 个表。
例如
  SELECT a.val, b.val, c.val FROM a JOIN b
    ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
如果join中多个表的 join key 是同一个，则 join 会被转化为单个 map/reduce 任务，例如：
  SELECT a.val, b.val, c.val FROM a JOIN b
    ON (a.key = b.key1) JOIN c
    ON (c.key = b.key1)
被转化为单个 map/reduce 任务，因为 join 中只使用了 b.key1 作为 join key。
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)
  JOIN c ON (c.key = b.key2)
而这一 join 被转化为 2 个 map/reduce 任务。因为 b.key1 用于第一次 join 条件，而 b.key2 用于第二次 join。
   
3．join 时，每次 map/reduce 任务的逻辑：
    reducer 会缓存 join 序列中除了最后一个表的所有表的记录，再通过最后一个表将结果序列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。实践中，应该把最大的那个表写在最后（否则会因为缓存浪费大量内存）。例如：
 SELECT a.val, b.val, c.val FROM a
    JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
所有表都使用同一个 join key（使用 1 次 map/reduce 任务计算）。Reduce 端会缓存 a 表和 b 表的记录，然后每次取得一个 c 表的记录就计算一次 join 结果，类似的还有：
  SELECT a.val, b.val, c.val FROM a
    JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
这里用了 2 次 map/reduce 任务。第一次缓存 a 表，用 b 表序列化；第二次缓存第一次 map/reduce 任务的结果，然后用 c 表序列化。

4．LEFT，RIGHT 和 FULL OUTER 关键字用于处理 join 中空记录的情况
例如：
  SELECT a.val, b.val FROM 
a LEFT OUTER  JOIN b ON (a.key=b.key)
对应所有 a 表中的记录都有一条记录输出。输出的结果应该是 a.val, b.val，当 a.key=b.key 时，而当 b.key 中找不到等值的 a.key 记录时也会输出:
a.val, NULL
所以 a 表中的所有记录都被保留了；
“a RIGHT OUTER JOIN b”会保留所有 b 表的记录。

Join 发生在 WHERE 子句之前。如果你想限制 join 的输出，应该在 WHERE 子句中写过滤条件——或是在 join 子句中写。这里面一个容易混淆的问题是表分区的情况：
  SELECT a.val, b.val FROM a
  LEFT OUTER JOIN b ON (a.key=b.key)
  WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'
会 join a 表到 b 表（OUTER JOIN），列出 a.val 和 b.val 的记录。WHERE 从句中可以使用其他列作为过滤条件。但是，如前所述，如果 b 表中找不到对应 a 表的记录，b 表的所有列都会列出 NULL，包括 ds 列。也就是说，join 会过滤 b 表中不能找到匹配 a 表 join key 的所有记录。这样的话，LEFT OUTER 就使得查询结果与 WHERE 子句无关了。解决的办法是在 OUTER JOIN 时使用以下语法：
  SELECT a.val, b.val FROM a LEFT OUTER JOIN b
  ON (a.key=b.key AND
      b.ds='2009-07-07' AND
      a.ds='2009-07-07')
这一查询的结果是预先在 join 阶段过滤过的，所以不会存在上述问题。这一逻辑也可以应用于 RIGHT 和 FULL 类型的 join 中。

Join 是不能交换位置的。无论是 LEFT 还是 RIGHT join，都是左连接的。
  SELECT a.val1, a.val2, b.val, c.val
  FROM a
  JOIN b ON (a.key = b.key)
  LEFT OUTER JOIN c ON (a.key = c.key)
先 join a 表到 b 表，丢弃掉所有 join key 中不匹配的记录，然后用这一中间结果和 c 表做 join。这一表述有一个不太明显的问题，就是当一个 key 在 a 表和 c 表都存在，但是 b 表中不存在的时候：整个记录在第一次 join，即 a JOIN b 的时候都被丢掉了（包括a.val1，a.val2和a.key），然后我们再和 c 表 join 的时候，如果 c.key 与 a.key 或 b.key 相等，就会得到这样的结果：NULL, NULL, NULL, c.val

https://www.cnblogs.com/biehongli/p/7699578.html


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：[Hive]Hive中表连接的优化，加快..	下一篇：Hive实战函数Find_in_set