HBase官方文档翻译——HBase and MapReduce - HBase

TOP

HBase官方文档翻译——HBase and MapReduce

2019-01-03 13:48:08 【大中小】浏览:107次

本章讨论在HBase中使用MapReduce数据时需要采取的具体配置步骤。另外，它讨论了HBase和MapReduce作业之间的其他交互和问题。最后，它讨论了Cascading，MapReduce的另一种API。

47、HBase, MapReduce, and the CLASSPATH

默认情况下，部署到MapReduce集群的MapReduce作业无权访问$ HBASE_CONF_DIR下的HBase配置或HBase类。

要为MapReduce作业提供他们所需的访问权限，可以添加hbase-site.xml_to _ $ HADOOP_HOME / conf并将HBase jar添加到$ HADOOP_HOME / lib目录。然后，您需要在群集中复制这些更改。或者，您可以编辑$ HADOOP_HOME / conf / hadoop-env.sh并将Hbase依赖项添加到HADOOP_CLASSPATH变量。这两种方法都不推荐使用，因为它会使用HBase引用污染您的Hadoop安装。它还需要您在Hadoop可以使用HBase数据之前重新启动Hadoop集群。

推荐的方法是让HBase添加它的依赖jar，并使用HADOOP_CLASSPATH或-libjars。

自从hbase 0.90.x以后，hbase将它的依赖jar添加到job配置本身。依赖只需要在本地classpath本地可用就行。在classpath中的依赖会被挑选出来并被绑定到一个fat job jar部署到MapReduce集群上。一个基本的技巧就是，将完整的hbase classpath——所有hbase和依赖jar以及配置，传递给mapreduce job runner，让hbase实用程序从完整的classpath中选其需要，并将他们添加到mapreduce job configuration中。（请参阅源代码在TableMapReduceUtil＃addDependencyJars（org.apache.hadoop.mapreduce.Job）中如何完成）。

下面的示例是在一张名为usertable的table中，运行HBase RowCounter MapReduce job。它将hbase在MapReduce上下文中需要运行的jar添加到HADOOP_CLASSPATH中，这些jar包括配置文件比如hbase-site.xml。请确保为你的系统选择正确版本的hbase jar；替换掉下面命令行中的VERSION字符串，使用你本地安装的hbase的版本号。反引号（`符号）使shell执行子命令，将hbase classpath的输出设置为HADOOP_CLASSPATH。这个例子假设你使用BASH兼容的shell。

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-mapreduce-VERSION.jar \
  org.apache.hadoop.hbase.mapreduce.RowCounter usertable

上面的命令会在hbase集群上运行row counting mapreduce job。首先hadoop配置会指向集群上本地hbase配置，然后hbase配置会指向相应的hbase集群。

hbase-mapreduce.jar的主要部分是一个Driver，其中列出了几个与hbase一起提供的基本mapreduce任务。例如，假设您的安装是hbase 2.0.0-SNAPSHOT：

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-mapreduce-2.0.0-SNAPSHOT.jar
An example program must be given as the first argument.
Valid program names are:
  CellCounter: Count cells in HBase table.
  WALPlayer: Replay WAL files.
  completebulkload: Complete a bulk data load.
  copytable: Export a table from local cluster to peer cluster.
  export: Write table data to HDFS.
  exportsnapshot: Export the specific snapshot to a given FileSystem.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table.
  verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is changed after being appended to the log.

您可以使用以上列出的缩短名称作为mapreduce作业，如下面的重新运行rowcounter job（再次假设您的安装是hbase 2.0.0-SNAPSHOT）：

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-mapreduce-2.0.0-SNAPSHOT.jar \
  rowcounter usertable

..........(这里省略一些不重要的内容的翻译)

MapReduce Scan Caching

TableMapReduceUtil现在恢复了在传入的Scan对象中设置scanner caching的选项。由于HBase 0.95（HBASE-11558）中的错误，此功能丢失。在0.98.5和0.96.3修复。选择scanner caching的优先顺序如下：

在scan对象上设置的caching setting
通过配置项hbase.client.scanner.caching指定的caching setting，可以在hbase-site.xml中手动设置，也可以通过帮助器方法TableMapReduceUtil.setScannerCaching（）来设置。
默认值HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING，设置为100。

优化caching setting是client等待一个result的时间和client需要去接收的result set的数量的一种平衡。如果缓存设置过大，客户端可能会等待很长时间，否则请求可能会超时。如果设置太小，扫描需要返回几个结果。如果将扫描视为铲子，则更大的缓存设置类似于更大的铲子，而更小的缓存设置相当于铲取更多铲子。

上面提到的优先级列表允许您设置合理的默认值，并针对特定操作对其进行覆盖。

有关更多详细信息，请参阅Scan的API文档。

Bundled HBase MapReduce Jobs

hbase jar也可以作为bundled MapReduce job的驱动程序。要了解bundled MapReduce jobs，运行以下命令：

$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar
An example program must be given as the first argument.
Valid program names are:
  copytable: Export a table from local cluster to peer cluster
  completebulkload: Complete a bulk data load.
  export: Write table data to HDFS.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table

每个有效的程序名都是bundled MapReduce job。要运行其中一个job，请在下面的示例之后为您的命令建模。

$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar rowcounter myTable

HBase as a MapReduce Job Data Source and Data Sink

hbase可以作为MapReduce job的数据源，通过TableInputFormat；也可以作为MapReduce job的数据接收器，通过TableOutputFormat，或者MultiOutPutFormat。在MapReduce job读或者写hbase，建议继承TableMapper或者TableReducer。

有关基本用法，请参阅不做任何传递类IdentityTableMapper和IdentityTableReducer。有关更多参与的示例，请参阅RowCounter或查看org.apache.hadoop.hbase.mapreduce.TestTableMapReduce单元测试。

如果你在MapReduce job中使用hbase作为source或者sink，需要在你的configuration中明确作为source或者sink的table name和column name。

当你从hbase读取数据时，TableInputFormat从hbase中读取region的列表，并生成一个map。该map可以是map-per-region或者mapreduce.job.maps，以较小的为准。如果你的job只有2个map，提升mapreduce.job.maps为一个大于region数的数字。如果您为每个节点运行TaskTracer / NodeManager和RegionServer，地图将在相邻的（相邻的，adjacent）TaskTracker / NodeManager上运行。

当写入hbase时，避免使用Reduce步骤并从map内部写回HBase是有意义的。当您的job不需要MapReduce对map-emitted data进行排序时，这种方法就可以工作。在插入时，HBase'排序'，因此，除非需要，否则双重排序没有意义。如果您不需要Reduce，则映射可能会发出在作业结束时为报告处理的记录计数，或者将Reduces的数量设置为零并使用TableOutputFormat。如果在你的情况下运行Reduce步骤是有意义的，那么通常应该使用多个reducers，以便负载遍布HBase集群。

ps：意思是写入hbase时不需要reduce阶段，不需要用reduce将数据排序，在插入时hbase会自行排序。

一个新的hbase partitioner，HRegionPartitioner，可以运行和现有region数量一样多的reducer。当你的table很大时，使用HRegionPartitioner是很合适的，你的上传不会在完成时大大改变现有region的数量。否则，使用默认paritioner。

Writing HFiles Directly During Bulk Import

如果您正在导入新table，则可以绕过HBase API并将您的内容直接写入文件系统，格式化为HBase数据文件（HFiles）。您的导入将运行得更快，也许快一个数量级。有关此机制如何工作的更多信息，请参阅bulk loading。

RowCounter Example

内置的RowCounter MApReduce job 使用TableInputFormat，并在给定的table上统计row的总数。使用下列命令运行：

$ ./bin/hadoop jar hbase-X.X.X.jar

这个会调用HBase MapReduce Driver类。从提供的job选项中选择rowcounter。确定带统计的tablename，column和输出目录。

53节省略。。。。

54、HBase MapReduce Examples

HBase MapReduce Read Example

以下是以只读方式将HBase用作MapReduce源的示例。具体来说，有一个Mapper实例，但没有Reducer，并且没有任何内容正从Mapper发出。这项工作将被定义如下...

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
  tableName,        // input HBase table name
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper
  null,             // mapper output key
  null,             // mapper output value
  job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

并且这个mapper实例继承自TableMapper

public static class MyMapper extends TableMapper<Text, Text> {

  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
    // process data for the row from the Result instance.
   }
}

HBase MapReduce Read/Write Example

以下是使用HBase作为MapReduce的源代码和接收器的示例。这个例子将简单地将数据从一个表复制到另一个表。

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs

TableMapReduceUtil.initTableMapperJob(
  sourceTable,      // input table
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper class
  null,             // mapper output key
  null,             // mapper output value
  job);
TableMapReduceUtil.initTableReducerJob(
  targetTable,      // output table
  null,             // reducer class
  job);
job.setNumReduceTasks(0);

boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}

需要说明的是TableMapReduceUtil正在做什么，特别是对于Reducer。 TableOutputFormat被用作outputFormat类，并且正在配置几个参数（例如TableOutputFormat.OUTPUT_TABLE），并将reducer输出键设置为ImmutableBytesWritable和Reducer值为Writable。这些可以由程序员在工作和conf中设置，但TableMapReduceUtil会尽量简化。

以下是示例映射器，它将创建一个Put并匹配输入Result并发出它。注意：这是CopyTable实用程序的功能。

public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put>  {

  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
    // this example is just copying the data from the source table...
      context.write(row, resultToPut(row,value));
    }

    private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
      Put put = new Put(key.get());
      for (KeyValue kv : result.raw()) {
        put.add(kv);
      }
      return put;
    }
}

实际上并没有reducer步骤，因此TableOutputFormat负责将Put发送到目标表。

这只是一个例子，开发人员可以选择不使用TableOutputFormat并连接到目标表本身。

54.3-57节省略。。。


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：Hbase入库因素小结（停止更新）	下一篇：启动Hbase后Hmaster自动消失问题..