设为首页 加入收藏

TOP

0.98 HBASE中reversedScan的问题研究
2019-04-28 01:50:57 】 浏览:50
Tags:0.98 HBASE reversedScan 问题 研究

一、问题发现

近期,公司新上线了一个基于hbase的应用,在该应用中会涉及到大量hbase的scan行为。运行一阵子后,该应用间歇性的会出现scan非常缓慢的现象。公司还有大量其他基于hbase的应用,每次都只有新应用有这种现象。


唯一与以往应用的区别是新应用中有reversedScan的使用,难道是reversescan引起的?


增加对应用的jstack监控,捕捉到慢scan发生时应用的jstack,发现了一些眉目。jstack中大量线程处于block状态。


java.lang.Thread.State: BLOCKED (on object monitor)

        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1319)

        - waiting to lock <0x00000000876a58d8> (a java.lang.Object)

        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1177)

        at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:294)

        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:130)

        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:55)

        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:201)

        at org.apache.hadoop.hbase.client.ReversedClientScanner.nextScanner(ReversedClientScanner.java:124)

        at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:140)

        at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:135)

        at org.apache.hadoop.hbase.client.ReversedClientScanner.<init>(ReversedClientScanner.java:62)

        at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:791)

对比hbase监控日志,同一时间meta所在的regionserver cpu占用会变的非常高。


看来是hbase-client在查询时,meta所在的regionserver cpu飙高,导致regionLocation变慢。但为什么regionserver的cpu会飙高?为什么其他应用却都保持正常?


二、问题研究

hbase-client在客户端侧会缓存RegionLocation,理论上真正要去regionserver查询regionLocation的次数应该并不多。


翻看源代码梳理ReversedScan时的具体逻辑。


ReversedClientScanner.java

class ReversedClientScanner extends ClientScanner {

  ......

  @Override
  protected boolean nextScanner(int nbRows, final boolean done)
      throws IOException {
    
    ......
    
    try {
      byte[] locateStartRow = locateTheClosestFrontRow  createClosestRowBefore(localStartKey)
          : null;
      callable = getScannerCallable(localStartKey, nbRows, locateStartRow);
      // Open a scanner on the region server starting at the
      // beginning of the region
      this.caller.callWithRetries(callable);
      this.currentRegion = callable.getHRegionInfo();
      if (this.scanMetrics != null) {
        this.scanMetrics.countOfRegions.incrementAndGet();
      }
    } catch (IOException e) {
      ExceptionUtil.rethrowIfInterrupt(e);
      close();
      throw e;
    }
  
    ......
  }
  
  protected ScannerCallable getScannerCallable(byte[] localStartKey,
      int nbRows, byte[] locateStartRow) {
    scan.setStartRow(localStartKey);
    ScannerCallable s =
        new ReversedScannerCallable(getConnection(), getTable(), scan, this.scanMetrics,
            locateStartRow, rpcControllerFactory.newController());
    s.setCaching(nbRows);
    return s;
  }
}

reversedScan时,会调用ReversedClientScanner.nextScanner方法。

nextScanner方法中,ReversedClientScanner创建了一个RpcRetryingCaller调用ReversedScannerCallable。


RpcRetryingCaller.callWithRetries会依次调用ReversedScannerCallable的prepare方法和call方法。


RpcRetryingCaller.java

public class RpcRetryingCaller<T> {

  ......

  public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout)
  throws IOException, RuntimeException {
    this.callTimeout = callTimeout;
    List<RetriesExhaustedException.ThrowableWithExtraContext> exceptions =
      new ArrayList<RetriesExhaustedException.ThrowableWithExtraContext>();
    this.globalStartTime = EnvironmentEdgeManager.currentTimeMillis();
    for (int tries = 0;; tries++) {
      long expectedSleep = 0;
      try {
        beforeCall();
        callable.prepare(tries != 0); // if called with false, check table status on ZK
        return callable.call();
      } catch (Throwable t) {
        if (tries > startLogErrorsCnt) {
          LOG.info("Call exception, tries=" + tries + ", retries=" + retries + ", retryTime=" +
              (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime) + "ms, msg="
              + callable.getExceptionMessageAdditionalDetail());
        }

    ......
  
  }
}


问题的关键就在于ReversedScannerCallable的prepare方法。


ReversedScannerCallable.java

public class ReversedScannerCallable extends ScannerCallable {
  
  ......

  @Override
  public void prepare(boolean reload) throws IOException {
    if (!instantiated || reload) {
      if (locateStartRow == null) {
        // Just locate the region with the row
        this.location = connection.getRegionLocation(tableName, row, reload);
        if (this.location == null) {
          throw new IOException("Failed to find location, tableName="
              + tableName + ", row=" + Bytes.toStringBinary(row) + ", reload="
              + reload);
        }
      } else {
        // Need to locate the regions with the range, and the target location is
        // the last one which is the previous region of last region scanner
        List<HRegionLocation> locatedRegions = locateRegionsInRange(
            locateStartRow, row, reload);
        if (locatedRegions.isEmpty()) {
          throw new DoNotRetryIOException(
              "Does hbase:meta exist hole Couldn't get regions for the range from "
                  + Bytes.toStringBinary(locateStartRow) + " to "
                  + Bytes.toStringBinary(row));
        }
        this.location = locatedRegions.get(locatedRegions.size() - 1);
      }
      setStub(getConnection().getClient(getLocation().getServerName()));
      checkIfRegionServerIsRemote();
      instantiated = true;
    }
    
    ......
  
  }
}


ReversedScannerCallable类的preapare方法中,会定位regionLocation,而useCache字段取值为reload。


回到RpcRetryingCaller的callWithRetries方法。

callWithRetries方法,在第一次尝试时传fasle,如果失败后续尝试传true。

这就导致了在reversedScan时,绝大多数场合下locationRegion时的usedCache一直为false!缓存没有生效,每次scan都要访问meta表获取scan rowkey所在的regionserver,这就是为什么meta所在的regionserver cpu占用非常高的原因。



ScannerCallable.java

public class ScannerCallable extends RegionServerCallable<Result[]> {

  ......

  @Override
  public void prepare(boolean reload) throws IOException {
    if (Thread.interrupted()) {
      throw new InterruptedIOException();
    }
    RegionLocations rl = RpcRetryingCallerWithReadReplicas.getRegionLocations(!reload,
        id, getConnection(), getTableName(), getRow());
    location = id < rl.size()  rl.getRegionLocation(id) : null;
  
    ......

  }
}

对比ScannerCallable类,其prepare方法中useCache取值为!reload,所以meta所在regionserver cpu占用高时,对其他应用没有特别大的影响。



三、问题解决

去github上翻了一下ReversedScannerCallable的修改纪录,发现HBASE-18665修复了该问题,但0.98版本分支上的代码并没有修复该问题。那只能自己动手了。

为尽可能降低重新编译引入的问题,单独对ReversedScannerCallable重新编译后,将class文件打进了hbase-client.jar。

将hbase-client替换后,在测试环境一切正常,meta表的调用监控也显示调用量有明显下降。

变更投产后,生产也不再出现慢scan。



】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
上一篇机房断电导致HBase集群region off.. 下一篇hbase increment代码

最新文章

热门文章

Hot 文章

Python

C 语言

C++基础

大数据基础

linux编程基础

C/C++面试题目