0.98 HBASE中reversedScan的问题研究 - HBase

TOP

0.98 HBASE中reversedScan的问题研究

2019-04-28 01:50:57 【大中小】浏览:50次

一、问题发现

近期，公司新上线了一个基于hbase的应用，在该应用中会涉及到大量hbase的scan行为。运行一阵子后，该应用间歇性的会出现scan非常缓慢的现象。公司还有大量其他基于hbase的应用，每次都只有新应用有这种现象。

唯一与以往应用的区别是新应用中有reversedScan的使用，难道是reversescan引起的？

增加对应用的jstack监控，捕捉到慢scan发生时应用的jstack，发现了一些眉目。jstack中大量线程处于block状态。

java.lang.Thread.State: BLOCKED (on object monitor)

        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1319)

        - waiting to lock <0x00000000876a58d8> (a java.lang.Object)

        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1177)

        at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:294)

        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:130)

        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:55)

        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:201)

        at org.apache.hadoop.hbase.client.ReversedClientScanner.nextScanner(ReversedClientScanner.java:124)

        at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:140)

        at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:135)

        at org.apache.hadoop.hbase.client.ReversedClientScanner.<init>(ReversedClientScanner.java:62)

        at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:791)

对比hbase监控日志，同一时间meta所在的regionserver cpu占用会变的非常高。

看来是hbase-client在查询时，meta所在的regionserver cpu飙高，导致regionLocation变慢。但为什么regionserver的cpu会飙高？为什么其他应用却都保持正常？

二、问题研究

hbase-client在客户端侧会缓存RegionLocation，理论上真正要去regionserver查询regionLocation的次数应该并不多。

翻看源代码梳理ReversedScan时的具体逻辑。

ReversedClientScanner.java

class ReversedClientScanner extends ClientScanner {

  ......

  @Override
  protected boolean nextScanner(int nbRows, final boolean done)
      throws IOException {
    
    ......
    
    try {
      byte[] locateStartRow = locateTheClosestFrontRow  createClosestRowBefore(localStartKey)
          : null;
      callable = getScannerCallable(localStartKey, nbRows, locateStartRow);
      // Open a scanner on the region server starting at the
      // beginning of the region
      this.caller.callWithRetries(callable);
      this.currentRegion = callable.getHRegionInfo();
      if (this.scanMetrics != null) {
        this.scanMetrics.countOfRegions.incrementAndGet();
      }
    } catch (IOException e) {
      ExceptionUtil.rethrowIfInterrupt(e);
      close();
      throw e;
    }
  
    ......
  }
  
  protected ScannerCallable getScannerCallable(byte[] localStartKey,
      int nbRows, byte[] locateStartRow) {
    scan.setStartRow(localStartKey);
    ScannerCallable s =
        new ReversedScannerCallable(getConnection(), getTable(), scan, this.scanMetrics,
            locateStartRow, rpcControllerFactory.newController());
    s.setCaching(nbRows);
    return s;
  }
}

reversedScan时，会调用ReversedClientScanner.nextScanner方法。

nextScanner方法中，ReversedClientScanner创建了一个RpcRetryingCaller调用ReversedScannerCallable。

RpcRetryingCaller.callWithRetries会依次调用ReversedScannerCallable的prepare方法和call方法。

RpcRetryingCaller.java

public class RpcRetryingCaller<T> {

  ......

  public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout)
  throws IOException, RuntimeException {
    this.callTimeout = callTimeout;
    List<RetriesExhaustedException.ThrowableWithExtraContext> exceptions =
      new ArrayList<RetriesExhaustedException.ThrowableWithExtraContext>();
    this.globalStartTime = EnvironmentEdgeManager.currentTimeMillis();
    for (int tries = 0;; tries++) {
      long expectedSleep = 0;
      try {
        beforeCall();
        callable.prepare(tries != 0); // if called with false, check table status on ZK
        return callable.call();
      } catch (Throwable t) {
        if (tries > startLogErrorsCnt) {
          LOG.info("Call exception, tries=" + tries + ", retries=" + retries + ", retryTime=" +
              (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime) + "ms, msg="
              + callable.getExceptionMessageAdditionalDetail());
        }

    ......
  
  }
}

问题的关键就在于ReversedScannerCallable的prepare方法。

ReversedScannerCallable.java

public class ReversedScannerCallable extends ScannerCallable {
  
  ......

  @Override
  public void prepare(boolean reload) throws IOException {
    if (!instantiated || reload) {
      if (locateStartRow == null) {
        // Just locate the region with the row
        this.location = connection.getRegionLocation(tableName, row, reload);
        if (this.location == null) {
          throw new IOException("Failed to find location, tableName="
              + tableName + ", row=" + Bytes.toStringBinary(row) + ", reload="
              + reload);
        }
      } else {
        // Need to locate the regions with the range, and the target location is
        // the last one which is the previous region of last region scanner
        List<HRegionLocation> locatedRegions = locateRegionsInRange(
            locateStartRow, row, reload);
        if (locatedRegions.isEmpty()) {
          throw new DoNotRetryIOException(
              "Does hbase:meta exist hole Couldn't get regions for the range from "
                  + Bytes.toStringBinary(locateStartRow) + " to "
                  + Bytes.toStringBinary(row));
        }
        this.location = locatedRegions.get(locatedRegions.size() - 1);
      }
      setStub(getConnection().getClient(getLocation().getServerName()));
      checkIfRegionServerIsRemote();
      instantiated = true;
    }
    
    ......
  
  }
}

ReversedScannerCallable类的preapare方法中，会定位regionLocation，而useCache字段取值为reload。

回到RpcRetryingCaller的callWithRetries方法。

callWithRetries方法，在第一次尝试时传fasle，如果失败后续尝试传true。

这就导致了在reversedScan时，绝大多数场合下locationRegion时的usedCache一直为false！缓存没有生效，每次scan都要访问meta表获取scan rowkey所在的regionserver，这就是为什么meta所在的regionserver cpu占用非常高的原因。

ScannerCallable.java

public class ScannerCallable extends RegionServerCallable<Result[]> {

  ......

  @Override
  public void prepare(boolean reload) throws IOException {
    if (Thread.interrupted()) {
      throw new InterruptedIOException();
    }
    RegionLocations rl = RpcRetryingCallerWithReadReplicas.getRegionLocations(!reload,
        id, getConnection(), getTableName(), getRow());
    location = id < rl.size()  rl.getRegionLocation(id) : null;
  
    ......

  }
}

对比ScannerCallable类，其prepare方法中useCache取值为!reload，所以meta所在regionserver cpu占用高时，对其他应用没有特别大的影响。

三、问题解决

去github上翻了一下ReversedScannerCallable的修改纪录，发现HBASE-18665修复了该问题，但0.98版本分支上的代码并没有修复该问题。那只能自己动手了。

为尽可能降低重新编译引入的问题，单独对ReversedScannerCallable重新编译后，将class文件打进了hbase-client.jar。

将hbase-client替换后，在测试环境一切正常，meta表的调用监控也显示调用量有明显下降。

变更投产后，生产也不再出现慢scan。


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：机房断电导致HBase集群region off..	下一篇：hbase increment代码