HBase 客户端类型（一） - HBase

TOP

HBase 客户端类型（一）

2018-12-01 02:01:20 【大中小】浏览:169次

HBase 自带了很多用于各种编程语言客户端。

1. 介绍 (Introduction)
----------------------------
从目前非常流行的语言和环境可以访问 HBase。可以直接使用客户端 API, 或者通过一些中间代理访问，将用户请求翻译成 API 调用。这些代理将原生的
Java API 包装成其它协议的 API，这样客户端就可以使用这些对外提供的 API 以任意语言来编写。通常来说，外部 API 实现为一个专用的基于 Java 的
服务器，可以在内部使用 HBase 提供的 Table 客户端 API. 这样简化了网关服务器(gateway servers)的实现和维护。

另一方面，有很多工具尽量隐藏 HBase 和它所提供的 API, 用户可与特定的接口对话，或者面向一组库来开发来形成访问层，例如，提供一个数据访问对象(
data access objects, DAOs)的持久层.

1.1 网关 (Gateways)
-----------------------------------------------------------------------------------------------------------------------------------------
回到网关的方法，网关与客户端之间是由当前可用的选择和远程客户端的要求驱动的。其中最常用的就是 REST(Representational State Transfer,
即表述性状态传递)，它是基于现有的 web 技术。实际的传输协议是典型的 HTTP 协议，web 应用的标准协议。由于协议层负责传输可互操作格式的数据，这
使 REST 称为异构系统之间传输数据的理想选择。

REST 自定义了语义，这样，协议就可以用于以普通的方式定位远程服务的资源。不用改变协议，REST 可以兼容现有的技术，例如，web server 以及 proxy.
资源作为请求 URI 的一部分唯一指定，与此相反，例如，基于 SOAP 的服务，要定义一个新协议来适应标准。

然而，REST 和 SOAP 都面临冗长的协议等级，用于客户端和服务器端通信的人类可读文本，可以是纯文本或者基于 XML。对网络间传输的数据进行透明压缩
可以在一定程度上缓解这个问题。

作为一个结果，拥有大量服务器的企业，大量的带宽占用和许多相互隔离的服务，觉得需要降低带宽开销并实现自己的 RPC 层。Google 就是其中之一，它们
实现了 Protocol Buffers。由于这个实现起初没有发布，Facebook 开发了自己的版本，叫做 Thrift.

这些项目都有类似的特性集，不同之处在于它们所支持的语言数量。Protocol Buffers 与 Thrift 比较关键不同的是，Protocol Buffers 没有自己的 RPC
栈(RPC stack), 而是生成 RPC 定义，可以与其它 RPC 库使用。

HBase 为 REST 和 Thrift 提供了辅助服务器，它们实现为独立网关服务器(standalone gateway servers), 可以运行在独立或专用的服务器上。由于 Thrift
有自己的 RPC 实现，网关服务器只是为它们提供了简单的封装。对于 REST, HBase 有自己的实现，提供访问存储的数据。

NOTE:
-------------------------------------------------------------------------------------------------------------------------------------
HBase 提供的 RESTServer 实际上支持 Protocol Buffers，没有实现一个单独的 RPC 服务器，而是利用了 HTTP 的 Accept 头来发送和接收在
Protocol Buffers 中编码的数据。

没有人建议网关服务器如何部署，可以将它与其它服务器共存运行，也可以运行在专用的集群上。

另一种方法时直接将它们允许在客户端节点上。例如，有一个 web 服务器，通过 PHP 构建网页，将网关进程运行在这台服务器上是有利的。这样，客户端与
网关的通信是在本地，而网关与 HBase 之间的 RPC 通信是原生协议。

选择一种网关类型不是一个简单的任务，这取决于用户的应用场景：

REST Use Case
-------------------------------------------------------------------------------------------------------------------------------------
REST 支持现有的基于 web 的体系，它能够完美地融合反向代理和其它缓存技术。并行运行多个 REST 服务器可以分摊它们之间的负载。例如，在每个
运行应用服务器的机器上运行 REST 服务，构建一个 single-app-to-server 关系。

Thrift/Avro Use Case
-------------------------------------------------------------------------------------------------------------------------------------
当需要在吞吐量上获取最好的性能时，使用紧凑的二进制协议。可以运行较少的服务器，例如，每个 region 服务器上运行一个 Thrift/Avro 服务器，
构成 many-apps-to-server 关系。

NOTE:
-------------------------------------------------------------------------------------------------------------------------------------
HBase 也含有一个 Avro 的 gateway server, 但由于较少的关注和支持，已在 HBase 0.96 之后的版本中放弃了。

1.2 框架 (Frameworks)
-----------------------------------------------------------------------------------------------------------------------------------------
软件开发的一个长期的趋势是模块化和解耦特定工作单元。可以将其称为职责分离(separation of responsibilities)或其它类似的名称，但它们的目标是
一致的：一次性构建通用的软件片段，而不是一遍一遍地重复发明轮子。很多编程语言都有模块的概念，在 Java 中是 JAR 文件，它为众多的消费者提供
共享代码。持久化库，或通用数据访问，就是这些类库之一。其中，一个流行的选择是 Hibernate，它为所有对象的持久化提供了通用接口。

也有数据操纵(data manipulation) 专用的语言，或者类似的尽可能无缝地进行数据操纵任务，以使其不对业务逻辑造成混乱。我们将会在下面探讨的域特定
语言(domain-specific languages, DSLs)涉及这些方面。另外一个新的趋势是将应用程序开发抽象出来，platform-as-a-service (PaaS)就是其最显著的表现。
它提供了快速编写应用所需的所有组件，包括应用程序服务器，相关的库文件，数据库，等待。

通过 PaaS, 仍需要编写代码并把它部署到 PaaS 所提供的基础设施上。逻辑上，下一步是提供数据访问 API, 应用程序不需要更进一步的设置就可以使用。
Google App Engine 服务就是这类服务之一，在它的服务上，用户可以与数据存储 API(作为一个库提供) 直接对话。这限制了应用程序的自由度，但假设
其存储 API 足够强大，并且对应用程序开发者的创造性没有强加限制，这可以使开发和管理应用程序非常容易。

Hadoop 是非常强大而且灵活的系统。实际上，Hadoop 上的很多组件都可以被替换，并且对于 Hadoop, 更应将它当做一个思想体系，而非特殊技术的集合。
由于这些原因，一类新的活跃的框架正在兴起是显而易见的。类似于 Google App Engine 服务，Hadoop 提供了服务器组件用于接受部署到其上的应用程序
请求，并对底层服务抽象出接口，例如存储服务。

令人感兴趣的是，这类框架，将它们称为数据应用程序服务器(data application servers), 或者 data-as-a-service (DaaS), 包括 Hadoop 的原生服务，
它是数据优先的(data first)。就像智能手机，安装实现业务场景的应用并运行在共享的数据所在的主机上。不需要耗费高昂的代价移动大量的数据来产生
结果。利用 HBase 作为存储引擎，可以让这些框架最优利用很多内置的特性，例如，利用服务器端的协处理器降低选择断言以及分析功能等。这里有一个
例子 Cask.

库和框架的共同之处是抽象层的概念，它是一个通用的数据访问 API 或 DSL。HBase 之上的框架也是如此，以及其它实现了 SQL 能力的通用存储层。将会在
后面的章节中分别讨论(see “SQL over NoSQL”), 它们提供了不同级别的 SQL 标准，例如 Impala, Hive, and Phoenix.

*
*
*

2. Gateway 客户端 (Gateway Clients)
-----------------------------------------------------------------------------------------------------------------------------------------
第一组客户端由 gateway 类型(gateway kind)组成, 向服务器按需发送客户端调用，如 get, put, 或 delete 。基于所选择的协议，通过所提供的 gateway
服务器，可以在自己的程序中访问。另一方面，也可以利用所提供的存储 API 来实现通用的，数据为中心的解决方案。

2.1 原生 Java API (Native Java)
-----------------------------------------------------------------------------------------------------------------------------------------
原生 Java API 已在前面各个章节讨论过，不需要启动任何 gateway 服务器，直接使用 Table 或 BufferedMutator 与 HBase 服务器通信，经由原生 RPC
调用。

2.2 REST
-----------------------------------------------------------------------------------------------------------------------------------------
HBase 自带了一个强大的 REST 服务器，完全支持客户端和管理 API. 它也为不同的消息格式提供支持，为客户端提供了多种选择来与服务器通信。

■ 操作 (Operation)
-----------------------------------------------------------------------------------------------------------------------------------------
对基于 REST 的客户端，要能够连接到 HBase, 需要启动合适的 gateway 服务器。这要通过所提供的脚本启动。下面的命令展示了如何获得命令行帮助，以
及以非守护进程的模式启动 REST 服务器。

$ bin/hbase rest
usage: bin/hbase rest start [--infoport <arg>] [-p <arg>] [-ro]
--infoport <arg> Port for web UI
-p,--port <arg> Port to bind to [default: 8080]
-ro,--readonly Respond only to GET HTTP method requests [default:false]

To run the REST server as a daemon, execute bin/hbase-daemon.sh start|stop rest [--infoport <port>] [-p <port>] [-ro]
$ bin/hbase rest start
^C

需要按下 Ctrl-C 来退出该进程。help 中说明了需要使用一个不同的脚本来以一个后台进程运行服务器：

$ bin/hbase-daemon.sh start rest
starting rest, logging to /var/lib/hbase/logs/hbase-larsgeorgerest-<servername>.out

一旦服务器启动了，需要在命令行上使用 curl 来验证其是可操作的：

$ curl http://<servername>:8080/
testtable

$ curl http://<servername>:8080/version
rest 0.0.3 [JVM: Oracle Corporation 1.7.0_51-24.51-b03] [OS: Mac OS X 10.10.2 x86_64] [Server: jetty/6.1.26] [Jersey: 1.9]

获取根 URL(root URL), 即 "/" 返回可用表的列表，例子中为 testtable。通过 "/version" 获取 REST 服务器版本，以及它运行所在机器相关的信息。

另一方面，可以打开由 REST 服务器提供的 web-based UI。可以通过 --infoport 命令行参数在启动 REST 服务器时指定其他端口，或者修改配置属性：
hbase.rest.info.port, 默认值为 8085.

UI 界面具有的功能，与 HBase 提供的很多 web-based UIs 功能是共同的。中间部分提供有关服务器及其状态信息。对于 REST 服务器，除了 HBase 版本，
编译信息，和服务器启动时间外，没有太多信息。底部有到 HBase Wiki 的链接，对 REST API 进行解释。在该页面的顶部有几个链接用于提供额外的功能。

● Home
-------------------------------------------------------------------------------------------------------------------------------------
链接到服务器的 Home 页

● Local logs
-------------------------------------------------------------------------------------------------------------------------------------
打开一个列出本地日志目录的页面，提供基于 web 访问其他方式无法访问的日志文件

● Log Level
-------------------------------------------------------------------------------------------------------------------------------------
这个页面可以为任何类查询和设置日志级别，或打包载入到服务器进程。

● Metrics Dump
-------------------------------------------------------------------------------------------------------------------------------------
HBase 中所有的服务器都有 metric 跟踪其活动，通过这个链接可以 JSON 格式访问。

● HBase Configuration
-------------------------------------------------------------------------------------------------------------------------------------
打印出用于当前服务器进程的配置信息。

要停止作为守护进程运行的 REST 服务器，调用同一个脚本，只是把 start 替换为 stop 命令。

$ bin/hbase-daemon.sh stop rest
stopping rest..

可以根据自己的喜好启动任意数量的 REST 服务器，并且，也可以利用一个负载均衡器在多个 REST 服务器间路由数据流量。由于 REST 是无状态的，任何
所需的状态都是作为请求的一部分携带的，可以使用循环算法(或类似的方法)来分布负载。

--readonly, 或 -ro 参数将服务器切换到只读模式(read-only mode), 意思是只对 HTTP GET 操作响应。最后，使用 -p 或 --port 参数为服务器侦听指定
一个不同的端口，默认为 8080. 还有一些 REST 服务器在启动时关注的属性，下表列出它们及其默认值：

Configuration options for the REST server
+---------------------------------------+---------------+--------------------------------------------------------------------------
| Property | Default | Description
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.dns.nameserver | default | Defines the DNS server used for the name lookup (NOTE:)
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.dns.interface | default | Defines the network interface that the name is associated with (NOTE:)
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.port | 8080 | Sets the HTTP port the server will bind to. Also settable per instance
| | | with the -p and --port command-line parameter
+---------------------------------------+---------------+--------------------------------------------------------------------------
| base.rest.host | 0.0.0.0 | Defines the address the server is listening on. Defaults to the wildcard
| | | address
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.info.port | 8085 | Specifies the port the webbased UI will bind to. Also settable per instance
| | | using the --infoport parameter
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.info.bindAddress | 0.0.0.0 | Sets the IP address the webbased UI is bound to. Defaults to the wildcard
| | | address
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.readonly | false | Forces the server into normal or read-only mode. Also settable by the
| | | --readonly, or -ro options
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.threads.max | 100 | Provides the upper boundary of the thread pool used by the HTTP server for
| | | request handlers
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.threads.min | 2 | Same as above, but sets the lower boundary on number of handler threads.
+---------------------------------------+---------------+--------------------------------------------------------------------------
|hbase.rest.connection.cleanup-interval |10000 (10secs) | Defines how often the internal housekeeping task checks for expired
| | | connections to the HBase cluster
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.connection.max-idletime |600000 (10mins)| Amount of time after which an unused connection is considered expired
+---------------------------------------+---------------+--------------------------------------------------------------------------
| hbase.rest.support.proxyuser | false | Flags if the server should support proxy users or not. This is used to
| | | enable secure impersonation.
+---------------------------------------+---------------+--------------------------------------------------------------------------

NOTE:
-------------------------------------------------------------------------------------------------------------------------------------
These two properties are used in tandem to look up the server’s hostname using the given network interface and name server. The
default value mean it uses whatever is configured on the OS level

■ 支持的格式 (Supported Formats)
-----------------------------------------------------------------------------------------------------------------------------------------
利用 HTTP 的 Content-Type 和 Accept 头，可切换不同的格式用于发送或者返回给调用者。作为一个示例，可以通过如下类似的 shell 命令在 HBase 中
创建一个表以并向其插入一行数据：

hbase(main):001:0> create 'testtable', 'colfam1'
0 row(s) in 0.6690 seconds

=> Hbase::Table - testtable
hbase(main):002:0> put 'testtable', "\x01\x02\x03", 'col
fam1:col1', 'value1'
0 row(s) in 0.0230 seconds

hbase(main):003:0> scan 'testtable'
ROW COLUMN+CELL
\x01\x02\x03 column=colfam1:col1, timestamp=1429367023394, value=value1
1 row(s) in 0.0210 seconds

例子使用二进制行键(binary row key) 0x01 0x02 0x03 (十六进制数字) 插入一行。其中指定一个列族中的一个列，包含的值为 value1。

● Plain 格式 (text/plain)
-------------------------------------------------------------------------------------------------------------------------------------
可以在一些操作中指定以文本格式返回数据。例如之前提到的 /version 操作：

$ curl -H "Accept: text/plain" http://<servername>:8080/version
rest 0.0.3 [JVM: Oracle Corporation 1.7.0_45-24.45-b08] [OS:Mac OS X 10.10.2 x86_64] [Server: jetty/6.1.26] [Jersey: 1.9]

另一方面，对于比较复杂的返回值，纯文本格式很难达到预期的结果：

$ curl -H "Accept: text/plain" \
http://<servername>:8080/testtable/%01%02%03/colfam1:col1
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 406 Not Acceptable</title>
</head>
<body><h2>HTTP ERROR 406</h2>
<p>Problem accessing /testtable/%01%02%03/colfam1:col1. Reason:
<pre> Not Acceptable</pre></p>
<hr /><i><small>Powered by Jetty://</small></i><br/>
<br/>
...
<br/>
</body>
</html>

这是由于这样的事实，服务器不能对如何格式化一个复杂的纯文本格式结果值进行假设。用户需要使用一种自身能够表达嵌套信息的格式。

NOTE:
-------------------------------------------------------------------------------------------------------------------------------------
示例中使用的行键是二进制形式的，由三个字节组成。可以通过使用 URL 编码对行键编码，以使用 REST 来访问这些字节，编码后的行键为 %01%02%03
因此获取一个 cell 的整个 URL 为：

http://<servername>:8080/testtable/%01%02%03/colfam1:col1

● XML (text/xml) 格式
-------------------------------------------------------------------------------------------------------------------------------------
在存储或者获取数据时，XML 被作为默认的的格式。例如，在没有指定特定的 Accept 头来获取示例中的行时，会接收到：

$ curl http://<servername>:8080/testtable/%01%02%03/colfam1:col1

<xml version="1.0" encoding="UTF-8" standalone="yes">
<CellSet>
<Row key="AQID">
<Cell column="Y29sZmFtMTpjb2wx" timestamp="1429367023394">dmFsdWUx</Cell>
</Row>
</CellSet>

返回的格式默认为 XML. 列名(column name) 和实际的值都经过了 Base64 编码。下面是它们各自的 schema 部分：

所有出现 base64Binary 的地方，REST 服务器都返回编码的数据。这样可以安全地传输二进制数据，可能包含在键或者值中。对发送到 REST 服务器
的数据也是如此。

在控制台上使用 base64 命令快速测试以显示正确的内容：

$ echo AQID | base64 -D | hexdump
0000000 01 02 03

$ echo Y29sZmFtMTpjb2wx | base64 -D
colfam1:col1

$ echo dmFsdWUx | base64 -D
value1

很明显，这只在命令行上验证有用。在自己的代码中，可以使用任何可用的 Base64 实现来解码返回的值。

● JSON (application/json) 格式
-------------------------------------------------------------------------------------------------------------------------------------
类似于 XML, 要求数据使用 JSON 格式，只是简单地要求设置 Accept 头：

$ curl -H "Accept: application/json" http://<servername>:8080/testtable/%01%02%03/colfam1:col1
{
"Row": [{
"key": "AQID",
"Cell": [{
"column": "Y29sZmFtMTpjb2wx",
"timestamp": 1429367023394,
"$": "dmFsdWUx"
}]
}]
}

对值的编码与 XML 相同，即对任何可能包含二进制数据的值都使用 Base64 编码。与 XML 格式显著的不同是 JSON 格式没有无名的数据段(data fields)
在 XML 中，cell 数据是在 Cell 标记中返回的，而 JSON 必须指定 key/value 对，因此没有与之对应的字段名称。由于这个原因，JSON 使用一个特殊
的名称 "$". "$" 字段就是 cell 的值。在上面的例子中，使用了：

"$":"dmFsdWUx"

需要查询 "$" 域(field)以得到 Base64 编码的数据。

● Protocol Buffer (application/x-protobuf) 格式
-------------------------------------------------------------------------------------------------------------------------------------
REST 应用程序一个有趣的地方是能够切换编码，由于 Protocol Buffer 没有原生 RPC 栈，因此 HBase REST 服务器提供了对其编码的支持。

获取 Protocol Buffer 编码的返回结果，要求匹配 Accept 头：

$ curl -H "Accept: application/x-protobuf" http://<servername>:8080/testtable/%01%02%03/colfam1:col1 | hexdump -C

...
00000000 0a 24 0a 03 01 02 03 12 1d 12 0c 63 6f 6c 66 61 |.
$.........colfa|
00000010 6d 31 3a 63 6f 6c 31 18 a2 ce a7 e7 cc 29 22 06 |
m1:col1......)".|
00000020 76 61 6c 75 65 31 |
value1|

使用 hexdump 命令可以二进制格式打印出编码的消息。需要一个 Protocol Buffer 解码器以结构化的方式访问实际的数据。

● Raw binary (application/octet-stream) 格式
-------------------------------------------------------------------------------------------------------------------------------------
最后，可以按其原始形式转储(dump)数据，而忽略其结构数据。在下面的控制台命令中，只有存储在 cell 中的数据返回：

$ curl -H "Accept: application/octet-stream" http://<servername>:8080/testtable/%01%02%03/colfam1:col1 | hexdump -C
00000000 76 61 6c 75 65 31 |value1|

NOTE:
-------------------------------------------------------------------------------------------------------------------------------------
取决于请求的格式，REST 服务器将结构化的数据放到自定义的头中。例如，对于原始的 get 请求，响应头类似如下：

HTTP/1.1 200 OK
Content-Length: 6
X-Timestamp: 1429367023394
Content-Type: application/octet-stream

cell 的时间戳已经被移到头部作为 X-Timestamp, 由于行和列的键是请求 URI 的一部分，因此它们在响应中被忽略掉，以防止没必要的数据传输。

■ REST Java 客户端 (REST Java Client)
-----------------------------------------------------------------------------------------------------------------------------------------
REST 服务器也带有全面的 Java 客户端 API, 位于 org.apache.hadoop.hbase.rest.client 包。其核心类是 RemoteHTable 和 RemoteAdmin, 下面示例
展示了 RemoteHTable 类的使用。

示例： Example of using the REST client classes

Cluster cluster = new Cluster();
//Set up a cluster list adding all known REST server hosts
cluster.add("localhost", 8080);

//Create the client handling the HTTP communication
Client client = new Client(cluster);

//Create a remote table instance, wrapping the REST access into a familiar interface
RemoteHTable table = new RemoteHTable(client, "testtable");

//Perform a get operation as if it were a direct HBase connection
Get get = new Get(Bytes.toBytes("row-30"));
get.addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("col-3"));
Result result1 = table.get(get);
System.out.println("Get result1: " + result1);

Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("row-10"));
scan.setStopRow(Bytes.toBytes("row-15"));
scan.addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("col-5"));

//Scan the table, again, the same approach as if using the native Java API.
ResultScanner scanner = table.getScanner(scan);
for (Result result2 : scanner) {
System.out.println("Scan row[" + Bytes.toString(result2.getRow())
+
"]: " + result2);
}

运行示例程序需要 REST 服务器已启动并且侦听在指定的端口上。如果在不同的机器和端口上运行服务器，要先调整 Cluster 实例 add 方法参数值。
输出类似于：

Adding rows to table...
Get result1: keyvalues={row-30/colfam1:col-3/1429376615162/Put/vlen=8/seqid=0}
Scan row[row-10]:
keyvalues={row-10/colfam1:col-5/1429376614839/Put/vlen=8/seqid=0}
Scan row[row-100]:
keyvalues={row-100/colfam1:col-5/1429376616162/Put/vlen=9/
seqid=0}
Scan row[row-11]:
keyvalues={row-11/colfam1:col-5/1429376614856/Put/vlen=8/seqid=0}
Scan row[row-12]:
keyvalues={row-12/colfam1:col-5/1429376614873/Put/vlen=8/seqid=0}
Scan row[row-13]:
keyvalues={row-13/colfam1:col-5/1429376614891/Put/vlen=8/seqid=0}
Scan row[row-14]:
keyvalues={row-14/colfam1:col-5/1429376614907/Put/vlen=8/seqid=0}

RemoteHTable 是与多个 REST 服务器对话的便捷方式，而且能够使用正常的 Java 客户端类，例如 Get, Scan.

NOTE:
-------------------------------------------------------------------------------------------------------------------------------------
当前的 REST 的 Java 客户端实现内部使用了 Protocol Buffer 编码与远程 REST 服务器通信，它是服务器所支持的最紧凑的协议，因而提供了最好的
带宽使用效率。

2.3 Thrift
-----------------------------------------------------------------------------------------------------------------------------------------
Apache Thrift 是由 C++ 写的，但为很多编程语言提供了模式编译器(schema compilers), 包括 Java, C++, Perl, PHP, Python, Ruby, 等待。一旦有了
编译好的 schema, 就可以与实现了一种或多种语言的系统透明地交换信息。

■ 安装 (Installation)
-----------------------------------------------------------------------------------------------------------------------------------------
在能使用 Thrift 之前，首先需要安装它，安装时最好使用适合操作系统的二进制安装包，如果没有就需要从源代码编译。

从网站下载源码 tar 包，并解压到常用目录：

$ wget http://www.apache.org/dist/thrift/0.9.2/thrift-0.9.2.tar.gz
$ tar -xzvf thrift-0.9.2.tar.gz -C /opt
$ rm thrift-0.9.2.tar.gz

安装依赖包，也就是 Automake, LibTool, Flex, Bison,以及 Boost 类库：

$ sudo apt-get install build-essential automake libtool flex bison libboost

构建并安装 Thrift:

$ cd /opt/thrift-0.9.2
$ ./configure
$ make
$ sudo make install

安装好之后，验证其是否成功：

$ thrift -version
Thrift version 0.9.2

一旦有了 Thrift 安装，需要编译一个 schema 到所选择的编程语言。HBase 自带了一个 schema 文件用于它的客户端 API 和管理 API. 需要使用 Thrift
二进制文件为开发环境创建封装。

在通过 Thrift 访问 HBase 之前，必须启动 ThriftServer。

■ Thrift 操作 (Thrift Operations)
-----------------------------------------------------------------------------------------------------------------------------------------
启动 Thrift server 通过其提供的脚本实现。可以给命令行加 -h 选项，或者忽略所有选项以获得帮助。

$ bin/hbase thrift
usage: Thrift [-b <arg>] [-c] [-f] [-h] [-hsha | -nonblocking |
-threadedselector | -threadpool] [--infoport <arg>] [-k <arg>] [-m <arg>] [-p <arg>] [-q <arg>] [-w <arg>]
-b,--bind <arg> Address to bind the Thrift server to.[default:0.0.0.0]
-c,--compact Use the compact protocol
-f,--framed Use framed transport
-h,--help Print help information
-hsha Use the THsHaServer This implies the framed transport.
--infoport <arg> Port for web UI
-k,--keepAliveSec <arg> The amount of time in secods to keep a thread alive when idle in TBoundedThreadPoolServer
-m,--minWorkers <arg> The minimum number of worker threads for TBoundedThreadPoolServer
-nonblocking Use the TNonblockingServer This implies the framed transport.
-p,--port <arg> Port to bind to [default: 9090]
-q,--queue <arg> The maximum number of queued requests in TBoundedThreadPoolServer
-threadedselector Use the TThreadedSelectorServer This implies the framed transport.
-threadpool Use the TBoundedThreadPoolServerThis is the default.
-w,--workers <arg> The maximum number of worker threads for TBoundedThreadPoolServer
To start the Thrift server run 'bin/hbase-daemon.sh start thrift'
To shutdown the thrift server run 'bin/hbase-daemon.sh stop thrift' or send a kill signal to the thrift server pid

有很多选项可以选择。type of server, protocol, 以及 transport 是由客户端强制使用的，因为不是所有的语言实现都支持它们。

使用默认选项，以非守护进程模式启动 Thrift server：

$ bin/hbase thrift start
^C

需要按下 Ctrl-C 退出进程。要以后台进程的模式运行，需要使用另一个脚本：

$ bin/hbase-daemon.sh start thrift
starting thrift, logging to /var/lib/hbase/logs/hbase-larsgeorge-thrift-<servername>.out

停止运行于守护进程模式的 Thrift server:

$ bin/hbase-daemon.sh stop thrift
stopping thrift..

一旦启动了 Thrift server，就可以打开 Thrift server 提供的 web-based UI。可以通过 --infoport 命令行参数，或者 hbase.thrift.info.port 配置
属性指定其端口，默认设置为 9095.

Thrift web-based UI 拥有的功能与许多 HBase 提供的 web-based UI 类似。

Thrift server 提供了操作 HBase 表所要求的所有操作。可以启动多个 Thrift server，并且，可以通过一个负载均衡器在服务器之间路由数据流。可以
使用 -p 或 --port 参数指定侦听不同的端口，默认为 9090.

还有一些配置属性在 Thrift server 启动时需要关注，如下：

Configuration options for the Thrift server
+-------------------------------------------------------+-------------------+-------------------------------------------------
| Property | Default | Description
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.regionserver.thrift.port | 9090 | Sets the port the server will bind to. Also
| | | settable per instance with the -p or --port
| | | command-line parameter.(NOTE:)
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.regionserver.thrift.ipaddress | 0.0.0.0 | Defines the address the server is listening on.
| | | Defaults to the wildcard address. Set with -b,
| | | --bind per instance on the command-line(NOTE:)
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.info.port | 9095 | Specifies the port the web-based UI will bind to.
| | |Also settable per instance using --infoport parameter
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.info.bindAddress | 0.0.0.0 | Sets the IP address the web-based UI is bound to.
| | | Defaults to the wildcard address
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.regionserver.thrift.server.type | threadpool | Sets the Thrift server type in non-HTTP mode.
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.regionserver.thrift.compact | false | Enables the compact protocol mode if set to true.
| | | Default means binary mode instead. Also settable per
| | | instance with -c, or --compact.
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.regionserver.thrift.framed | false | Sets the transport mode to framed. Otherwise the
| | | standard transport is used. Framed cannot be used
| | | in secure mode. When using the hsha or nonblocking
| | | server type, framed transport is always used
| | | irrespective of this configuration property. Also
| | | settable per instance with -f, or --framed.
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.regionserver.thrift.framed.max_frame_size_in_mb | 2097152 (2MB) | The maximum frame size when framed transport mode is
| | | enabled
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.minWorkerThreads | 16 | Sets the minimum amount of worker threads to keep,
| | | should be increased for production use (for example,
| | | to 200). Settable on the command-line with -m, or
| | | --minWorkers
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.maxWorkerThreads | 1000 | Sets the upper limit of worker threads. Settable
| | | on the command-line with -w, or --workers.
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.maxQueuedRequests | 1000 | Maximum number of request to queue when workers
| | | are all busy. Can be set with -q, and --queue per
| | | instance
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.threadKeepAliveTimeSec | 60 (secs) | Amount of time an extraneous idle worker is kept
| | | before it is discarded. Also settable with -k, or
| | | --keepAliveSec
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.regionserver.thrift.http | false | Flag that determines if the server should run in
| | | HTTP or native mode.
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.http_threads.max | 100 | Provides the upper boundary of the thread pool used
| | | by the HTTP server for request handlers
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.http_threads.min | 2 | Same as above, but sets the lower boundary on number
| | | of handler threads
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.ssl.enabled | false | When HTTP mode is enabled, this flag sets the SSL mode.
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.ssl.keystore.store | "" | When SSL is enabled, sets the key store file
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.ssl.keystore.password | null | When SSL is enabled, sets the password to unlock
| | | the key store file
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.ssl.keystore.keypassword | null | When SSL is enabled, sets the password to retrieve
| | | the keys from the key store
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.security.qop | "" | Can be one of auth, auth-int, or auth-conf to set
| | | the SASL qualityof-protection (QoP)
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.support.proxyuser | false | Flags if the server should support proxy users or not.
| | | This is used to enable secure impersonation
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.kerberos.principal | <hostname> | Can be used to set the Kerberos principal to use
| | | in secure mode
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.keytab.file | "" | Specifies the Kerberos keytab file for secure operation
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.regionserver.thrift.coalesceIncrement | false | Enables the coalesce mode for increments, which is
| | | a delayed, batch increment operation
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.filters | "" | Loads filter classes into the server process for
| | | subsequent use
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.connection.cleanup-interval | 10000 (10 secs) | Defines how often the internal housekeeping task
| | | checks for expired connections to the HBase cluster.
+-------------------------------------------------------+-------------------+-------------------------------------------------
| hbase.thrift.connection.max-idletime | 600000 (10 mins) | Amount of time after which an unused connection is
| | | considered expired
+-------------------------------------------------------+-------------------+-------------------------------------------------

● 非阻塞 (nonblocking)
-------------------------------------------------------------------------------------------------------------------------------------
使用 TNonblockingServer 类，它是基于 Java NIO 的 non-blocking I/O 技术，选择器线程(selector thread)处理实际的请求。每服务器实例可设置
使用 -nonblocking 参数。

● hsha
-------------------------------------------------------------------------------------------------------------------------------------
使用 THsHaServer 类，实现一个 Half-Sync/Half-Async(HsHa)服务器。与非阻塞服务器不同的是，使用一个线程来接受连接，但使用一个线程池来处理
工作线程。每服务器实例可设置，使用 -hsha 参数。

● threadedselector
-------------------------------------------------------------------------------------------------------------------------------------
在 HsHa 服务器基础上的扩展，维护两个线程池，其中之一用于网络 I/O(selection), 另一个用于处理工作线程。使用 TThreadedSelectorServer 类。
每服务器实例可设置，使用 -threadedselector 参数。

● threadpool
-------------------------------------------------------------------------------------------------------------------------------------
一个线程用于接受连接，然后将实际的工作在 ExecutorService 中调度。每一个连接专用于一个客户端，因此在高并发的使用场景需要大量的线程。
使用 TBoundedThreadPoolServer 类，是对 ThriftTThreadPoolServer 类的自定义实现。每服务器实例可设置，使用 -threadpool 参数。

默认的 threadpool 类型对于生产环境是一个很好的选择，它联合使用了多个被实践验证了的技术。

■ 示例：PHP
-----------------------------------------------------------------------------------------------------------------------------------------
HBase 不仅自带了所需要的 Thrift schema 文件，而且为多种编程语言提供了客户端示例。这里使用 PHP 实现以演示所需的步骤。

在开始之前，有几点需要注意：

● 需要为 web 服务器启用 PHP 支持。
-------------------------------------------------------------------------------------------------------------------------------------

● HBase 自带了预编译好的 PHP Thrift 模块，因此可以忽略下面的步骤 1，即新生成 Thrift 模块，其结果是一样的。代码在HBase 的hbase-examples
目录下。
-------------------------------------------------------------------------------------------------------------------------------------

● 所包含的 DemoClient.php 不是最新的，例如，使用空行键测试是不允许的，而且使用了非 UTF-8 行键。这两个检查都会失败，因此需要小心修复
PHP 文件。
-------------------------------------------------------------------------------------------------------------------------------------

步骤 1：
-----------------------------------------------------------------------------------------------------------------------------------------
复制 HBase 所提供的 schema 文件并编译必要的 PHP 源文件

$ cp -r $HBASE_HOME/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift ~/thrift_src
$ cd thrift_src/
$ thrift -gen php Hbase.thrift

在 thrift_src 目录中会产生一个 gen-php 目录，其中包含两个生成的访问 HBase 所需的 PHP 文件：

$ ls -l gen-php/Hbase/
total 920
-rw-r--r-- 1 larsgeorge staff 416357 Apr 20 07:46 Hbase.php
-rw-r--r-- 1 larsgeorge staff 52366 Apr 20 07:46 Types.php

如果忽略这个步骤，可以从 HBase 源码包的 hbase-examples 目录复制 HBase 所提供的，预先编译好的 PHP 文件：

$ ls -lR $HBASE_HOME/hbase-examples/src/main/php
total 24
-rw-r--r-- 1 larsgeorge admin 8438 Jan 25 10:47 DemoClient.php
drwxr-xr-x 3 larsgeorge admin 102 May 22 2014 gen-php
/usr/local/hbase-1.0.0-src/hbase-examples/src/main/php/gen-php:
total 0
drwxr-xr-x 4 larsgeorge admin 136 Jan 25 10:47 Hbase
/usr/local/hbase-1.0.0-src/hbase-examples/src/main/php/gen-php/Hbase:
total 800
-rw-r--r-- 1 larsgeorge admin 366528 Jan 25 10:47 Hbase.php

步骤 2：
-----------------------------------------------------------------------------------------------------------------------------------------
生成的文件还需要 Thrift 提供的 PHP 辅助文件，这些文件需要和生成的文件一起拷贝到 web 服务器的文档根目录：

$ cd /opt/thrift-0.9.2
$ sudo mkdir $DOCUMENT_ROOT/thrift/
$ sudo cp src/*.php $DOCUMENT_ROOT/thrift/
$ sudo cp -r lib/Thrift/* $DOCUMENT_ROOT/thrift/
$ sudo mkdir $DOCUMENT_ROOT/thrift/packages
$ sudo cp -r ~/thrift_src/gen-php/Hbase $DOCUMENT_ROOT/thrift/packages/

HBase 自带的 DemoClient.php 文件要使用生成的文件与服务器通信，因此也将这个文件拷贝到 web 服务器的根目录：

$ sudo cp $HBASE_HOME/hbase-examples/src/main/php/DemoClient.php $DOCUMENT_ROOT/

需要编辑 DemoClient.php 文件，在文件开始部分调整如下字段：

# Change this to match your thrift root
$GLOBALS['THRIFT_ROOT'] = 'thrift';
...
# According to the thrift documentation, compiled PHP thrift libraries should
# reside under the THRIFT_ROOT/packages directory. If these compiled libraries
# are not present in this directory, move them there from gen-php/.
require_once( $GLOBALS['THRIFT_ROOT'].'/packages/Hbase/
Hbase.php' );
...
$socket = new TSocket( 'localhost', 9090 );
...

调整之后，打开浏览器指向 demo 页面，如：

http://<webserver-address>/DemoClient.php

将会载入页面，并输出如下信息：

scanning tables...
found: testtable
creating table: demo_table
column families in demo_table:
column: entry:, maxVer: 10
column: unused:, maxVer: 3
Starting scanner...
...

C++, Java, Perl,Python, and Ruby 中也有相同的 demo 客户端。使用的步骤相同：启动 Thrift 服务器，编译 schema 定义文件到指定的语言，启动客户
端。根据不同的语言，需要将生成的代码放到合适的位置。

■ 示例：Java
-----------------------------------------------------------------------------------------------------------------------------------------
HBase 已经自带了生成的 Java 类用于与 Thrift 服务器通信，当然也可以从 schema 文件重新生成。下面的示例使用这些类与 Thrift 服务器通信，确保
Thrift 服务器已启动并在 9090 端口上侦听：

示例： Example using the Thrift generated client API

private static final byte[] TABLE = Bytes.toBytes("testtable");
private static final byte[] ROW = Bytes.toBytes("testRow");
private static final byte[] FAMILY1 = Bytes.toBytes("testFamily1");
private static final byte[] FAMILY2 = Bytes.toBytes("testFamily2");
private static final byte[] QUALIFIER = Bytes.toBytes("testQualifier");
private static final byte[] COLUMN = Bytes.toBytes("testFamily1:testColumn");
private static final byte[] COLUMN2 = Bytes.toBytes("testFamily2:testColumn2");
private static final byte[] VALUE = Bytes.toBytes("testValue");

public static void main(String[] args) throws Exception {

TTransport transport = new TSocket("0.0.0.0", 9090, 20000);
TProtocol protocol = new TBinaryProtocol(transport, true, true);

//Create a connection using the Thrift boilerplate classes.
Hbase.Client client = new Hbase.Client(protocol);
transport.open();

ArrayList<ColumnDescriptor> columns = new ArrayList<ColumnDescriptor>();

//Create two column descriptor instances.
ColumnDescriptor cd = new ColumnDescriptor();
cd.name = ByteBuffer.wrap(FAMILY1);
columns.add(cd);

cd = new ColumnDescriptor();
cd.name = ByteBuffer.wrap(FAMILY2);
columns.add(cd);

//Create the test table
client.createTable(ByteBuffer.wrap(TABLE), columns);

ArrayList<Mutation> mutations = new ArrayList<Mutation>();
mutations.add(new Mutation(false, ByteBuffer.wrap(COLUMN), ByteBuffer.wrap(VALUE), true));
mutations.add(new Mutation(false, ByteBuffer.wrap(COLUMN2), ByteBuffer.wrap(VALUE), true));
//Insert a test row
client.mutateRow(ByteBuffer.wrap(TABLE), ByteBuffer.wrap(ROW), mutations, null);

TScan scan = new TScan();
//Scan with an instance of TScan. This is the most convenient approach. Print the results in a loop.
int scannerId = client.scannerOpenWithScan(ByteBuffer.wrap(TABLE), scan, null);
for (TRowResult result : client.scannerGet(scannerId)) {
System.out.println("No. columns: " + result.getColumnsSize());
for (Map.Entry<ByteBuffer, TCell> column : result.getColumns().entrySet()) {
System.out.println("Column name: " + Bytes.toString(column.getKey().array()));
System.out.println("Column value: " + Bytes.toString(column.getValue().getValue()));
}
}
client.scannerClose(scannerId);

ArrayList<ByteBuffer> columnNames = new ArrayList<ByteBuffer>();
columnNames.add(ByteBuffer.wrap(FAMILY1));

//Scan again, but with another Thrift method. In addition, set the columns to a specific family only.
//Also print out the results in a loop.
scannerId = client.scannerOpen(ByteBuffer.wrap(TABLE), ByteBuffer.wrap(Bytes.toBytes("")), columnNames, null);
for (TRowResult result : client.scannerGet(scannerId)) {
System.out.println("No. columns: " + result.getColumnsSize());
for (Map.Entry<ByteBuffer, TCell> column : result.getColumns().entrySet()) {
System.out.println("Column name: " + Bytes.toString(column.getKey().array()));
System.out.println("Column value: " + Bytes.toString(column.getValue().getValue()));
}
}
client.scannerClose(scannerId);
System.out.println("Done.");

//Close the connection after everything is done
transport.close();
}

输出：
No. columns: 2
Column name: testFamily1:testColumn
Column value: testValue
Column name: testFamily2:testColumn2
Column value: testValue
No. columns: 1
Column name: testFamily1:testColumn
Column value: testValue
Done.

2.4 Thrift2
-----------------------------------------------------------------------------------------------------------------------------------------
由于 HBase 客户端 API 与 0.90 版本有了很大的变化，因此，Thrift API在很多地方与当前的 HBase 版本都已不同步了。Thrift2 实现了一个新的Thrift
gateway 服务器版本。它使用了当前的 HBase 客户端 API, 因而对于熟悉原生 Java API 的 HBase 开发者来说感觉上更加自然。

整体来说，Thrift2 服务器使用与原始的 Thrift 服务器类似的方式，可以从 thrift2 命令行选项看到其类似的操作：

$ bin/hbase thrift2

usage: Thrift [-b <arg>] [-c] [-f] [-h] [-hsha | -nonblocking |
-threadpool] [--infoport <arg>] [-p <arg>]
-b,--bind <arg> Address to bind the Thrift server to. [default:0.0.0.0]
-c,--compact Use the compact protocol
-f,--framed Use framed transport
-h,--help Print help information
-hsha Use the THsHaServer. This implies the framed transport.
--infoport <arg> Port for web UI
-nonblocking Use the TNonblockingServer. This implies the framed transport.
-p,--port <arg> Port to bind to [default: 9090]
-threadpool Use the TThreadPoolServer. This is the default.
To start the Thrift server run 'bin/hbase-daemon.sh start thrift2'
To shutdown the thrift server run 'bin/hbase-daemon.sh stop thrift2' or send a kill signal to the thrift server pid

使用默认选项，以非守护进程模式运行：

$ bin/hbase thrift2 start
^C

以守护进程运行：

$ bin/hbase-daemon.sh start thrift2
starting thrift2, logging to /var/lib/hbase/logs/hbase-larsgeorge-thrift2-<servername>.out

停止守护进程：

$ bin/hbase-daemon.sh stop thrift2
stopping thrift2.

2.5 SQL over NoSQL
-----------------------------------------------------------------------------------------------------------------------------------------
运行于 NoSQL 之上的 SQL 框架，使 HBase 看起来很像传统的 RDBMS: 可以有事务、索引、引用完整性，以及其它知名的特性，所有这些都建立在非 SQL
的系统之上。这些框架在不同级别上实现完整性，围绕 HBase 本身添加了一些服务，重新加入一些数据库相关的特性。下面是一些知名的项目：

■ Phoenix
-------------------------------------------------------------------------------------------------------------------------------------
Apache Phoenix 项目提供最纯粹的 HBase 的集成。该框架使用很多高级的特性，来优化对HBase 表的 SQL 查询，包括辅助索引的协处理器，以及过滤

■ Trafodion
-------------------------------------------------------------------------------------------------------------------------------------
由 HP 作为开源软件开发，是一个联合了现有数据库技术的系统，使用 HBase 作为存储层。

■ Impala
-------------------------------------------------------------------------------------------------------------------------------------
另一个开源的， Apache 许可证的项目。其主要用于执行对存储在 HDFS 中的数据进行交互式查询，它也有直接访问 HBase 表的能力。

■ Hive with Tez/Spark
-------------------------------------------------------------------------------------------------------------------------------------
最初用于 Hadoop batch 框架以执行数据处理。通过使用其它引擎替换 MapReduce 的选项，例如 Tez 或 Spark, 可以对 HBase 表运行基于 HiveQL 的交互式查询。

系列目录：

参考：

《HBase - The Definitive Guide - 2nd Edition》Early release —— 2015.7 Lars George


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：Hbase各个组件的作用	下一篇：在MapReduce中连接Hbase数据库