设为首页 加入收藏

TOP

Hive内部表和外部表的区别
2019-02-15 01:06:09 】 浏览:118
Tags:Hive 内部 外部 区别
Hive内部表和外部表的区别
参考网址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ManagedandExternalTables
Managed and External Tables
By default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. A managed table is stored under thehive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation.

 If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.
--如果内部表或分区drop掉的话,数据和元数据都会被删除。没有purge选项的话,数据被移动到垃圾文件夹,在定义的时间段内。
Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.
--使用Hive的内部表时需要管理表的生命周期,否则会产生临时表。

An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. 
--外部表在外部的文件中描述metadata / schema,外部表文件能够被Hive外部的进程访问和管理。
External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.
--外部表能够访问存储在asv或者远程hdfs位置上的数据,如果外部表的结构和分区改变,需要刷新元数据的信息(使用语句:MSCK REPAIR TABLE table_name;)
Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.
--使用外部表时,当文件还存在或者在远端的位置,及时表被drop掉了,文件件也仍然存在。
Managed or external tables can be identified using the DESCRIBE FORMATTED table_name command, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on table type.
--可以通过DESCRIBE FORMATTED table_name的命令辨别内部表还是外部表,命令结果将根据表的类型显示MANAGED_TABLE或者 EXTERNAL_TABLE
Statistics can be managed on internal and external tables and partitions for query optimization. 

--内部表,外部表和分区的统计信息都能够被管理。


--修复表的命令:
Recover Partitions (MSCK REPAIR TABLE)

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command), the metastore (and hence Hive) will not be aware of these partitions unless the user runs ALTER TABLE table_name ADD PARTITION commands on each of the newly added partitions.

--通过hadoop fs -put command命令直接加到hdfs的新分区,metastore无法知道分区,除非用户对每个新加入的分区运行ALTER TABLE table_name ADD PARTITION命令,然而,用户也可以运行MSCK REPAIR TABLE table_name;命令来修复。

However, users can run a metastore check command with the repair table option:

MSCK REPAIR TABLE table_name;
which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist.

 In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore. See HIVE-874 for more details. 

--命令将会把不存在的分区metadata加入到Hive metastore,换句话说,它会将任何存在于hdfs但不存在于metastore的分区加入到metastore中。

When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.

--如果有很多分区没加入的话,建议用MSCK REPAIR TABLE batch语句来避免OOM。通过配置hive.msck.repair.batch.size,可以内部批量地运行。默认的值为0,表示马上执行所有的分区。










】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
上一篇hive之hwi的使用与配置 下一篇sqoop1.4.6安装

最新文章

热门文章

Hot 文章

Python

C 语言

C++基础

大数据基础

linux编程基础

C/C++面试题目