Hbase 错误记录及修改方法 - HBase

TOP

Hbase 错误记录及修改方法

2019-03-15 01:41:47 【大中小】浏览:87次

1. Hbase 在运行或者操作过程中经常发生各种各样的问题，大部分问题是可以通过修改配置文件来解决的，当然可以修改源代码。

当hbase的并发量上来的时候，经常会导致Hbase出现“ Too Many Open Files”（打开的文件过多）的问题，日志记录如下：
2012-06-01 16:05:22,776 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketException: 打开的文件过多
2012-06-01 16:05:22,776 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3790131629645188816_18192

2012-06-01 16:13:01,966 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_-299035636445663861_7843 file=/hbase/SendReport/83908b7af3d5e3529e61b870a16f02dc/data/17703aa901934b39bd3b2e2d18c671b4.9a84770c805c78d2ff19ceff6fecb972
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1812)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1638)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1695)
at java.io.DataInputStream.readBoolean(DataInputStream.java:242)
at org.apache.hadoop.hbase.io.Reference.readFields(Reference.java:116)
at org.apache.hadoop.hbase.io.Reference.read(Reference.java:149)
at org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:216)
at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:282)
at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2510)
at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:449)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3228)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3176)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:331)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:107)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

原因及修改方法：由于 Linux系统最大可打开文件数一般默认的参数值是1024，通过 ulimit -n 65535 可即时修改，但重启后就无效了。或者有如下修改方式：
有如下三种修改方式：

1.在/etc/rc.local 中增加一行 ulimit -SHn 65535
2.在/etc/profile 中增加一行 ulimit -SHn 65535
3.在/etc/security/limits.conf最后增加如下两行记录
* soft nofile 65535
* hard nofile 65535

2. 发现HDFS写入过程中有两个超时设置：dfs.socket.timeout和dfs.datanode.socket.write.timeout；有些地方以为只是需要修改后面的dfs.datanode.socket.write.timeout项就可以，其实看报错是READ_TIMEOUT。对应在hbase中的默认值如下：

// Timeouts for communicating with DataNode for streaming writes/reads

public static int READ_TIMEOUT =60 * 1000; //其实是超过了这个值

public static int READ_TIMEOUT_EXTENSION = 3 * 1000;

public static int WRITE_TIMEOUT = 8 * 60 * 1000;

public static int WRITE_TIMEOUT_EXTENSION = 5 * 1000; //for write pipeline

日志：

11/10/12 10:50:44 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_8540857362443890085_4343699470java.net.SocketTimeoutException: 66000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.*.*.*:14707 remote=/*.*.*.24:80010]

原因及修改方法：

所以找出来是超时导致的，所以在hadoop-site.xml配置文件中添加如下配置：

<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>3000000</value>
</property>

<property>
<name>dfs.socket.timeout</name>
<value>3000000</value>
</property>
</configuration>

3. HADOOP报错Incompatible namespaceIDs

Workaround 1: Start from scratch

I can testify that the following steps solve this error, but the side effects won't make you happy (me neither). The crude workaround I have found is to:

1.stop the cluster

2.delete the data directory on the problematic datanode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data

3.reformat the namenode (NOTE: all HDFS data is lost during this process!)

4.restart the cluster

When deleting all the HDFS data and starting from scratch does not sound like a good idea (it might be ok during the initial setup/testing), you might give the second approach a try.

Workaround 2: Updating namespaceID of problematic datanodes

Big thanks to Jared Stehler for the following suggestion. I have not tested it myself yet, but feel free to try it out and send me your feedback. This workaround is "minimally invasive" as you only have to edit one file on the problematic datanodes:

1.stop the datanode

2.edit the value of namespaceID in <dfs.data.dir>/current/VERSION to match the value of the current namenode

3.restart the datanode

If you followed the instructions in my tutorials, the full path of the relevant file is /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data/current/VERSION (background: dfs.data.dir is by default set to ${hadoop.tmp.dir}/dfs/data, and we set hadoop.tmp.dir to /usr/local/hadoop-datastore/hadoop-hadoop).

If you wonder how the contents of VERSION look like, here's one of mine:

#contents of <dfs.data.dir>/current/VERSION

namespaceID=393514426

storageID=DS-1706792599-10.10.10.1-50010-1204306713481

cTime=1215607609074

storageType=DATA_NODE

layoutVersion=-13


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：【三个臭皮匠】日志处理系统之Hba..	下一篇：HBase学习笔记：使用BulkLoad特性..